Commits · d64182ea82d89bdb9cec0da3fb88bf870a9f448a · wangsen / MinerU

27 Feb, 2025 1 commit
- Update version.py with new version · d64182ea
  myhloli authored Feb 27, 2025
  
  d64182ea
26 Feb, 2025 3 commits

fix: match multiple captions · 15cd97ff
icecraft authored Feb 26, 2025

15cd97ff

refactor(magic_pdf): simplify device selection in model initialization · 0a246f0f

myhloli authored Feb 26, 2025

- Replace complex device selection logic with a single line using torch.device
- Remove redundant checks and imports for better readability and maintainability

0a246f0f

refactor(magic_pdf): remove bfloat16 support checks and usage · 9b00f988

myhloli authored Feb 26, 2025

- Remove supports_bfloat16 variable and related checks
- Remove model.bfloat16() call for LayoutLMv3ForTokenClassification
- Simplify device selection logic

9b00f988

25 Feb, 2025 2 commits

feat(ocr_mkcontent): add full-width to half-width character conversion · 315adbce

myhloli authored Feb 25, 2025

- Implement full_to_half function to convert full-width characters to half-width
- Apply conversion to span content before merging paragraphs
- Improve text processing for better readability and consistency

315adbce

perf(model): optimize batch analyze process · 6753df8d

myhloli authored Feb 25, 2025

- Move batch model initialization outside the loop
- Collect page dimensions before analyzing- Update page info dictionary structure
- Add null dimensions for non-analyzed pages

6753df8d

24 Feb, 2025 3 commits

feat(pre_proc): add block type compatibility check for span allocation · 19916856

myhloli authored Feb 24, 2025

- Introduce span_block_type_compatible function to check compatibility between span and block types
- Update fill_spans_in_blocks function to use the new compatibility check
- Improve accuracy of span allocation to blocks based on content type

19916856

fix(llm_aided): update prompt · 9e332f06
myhloli authored Feb 24, 2025

9e332f06

fix(magic_pdf): correct negative indexing for `end_page_id` · 90a27ecd

myhloli authored Feb 24, 2025

- Update the logic for determining `end_page_id` to handle negative values
- This change ensures proper behavior when `end_page_id` is set to -1 or other negative values

90a27ecd

23 Feb, 2025 1 commit

chore(magic_pdf): enhance license logging information · 3fe315d8

myhloli authored Feb 23, 2025

- Add license ID information to the log for better traceability
- Improve logging format to include both license ID and expiration date

3fe315d8

22 Feb, 2025 1 commit
- fix doc_analyze first page only · 37f3e200
  Nathan Dahlberg authored Feb 22, 2025
  
  37f3e200
21 Feb, 2025 3 commits

fix(model): handle import errors and improve exception logging · 66f0899a

myhloli authored Feb 21, 2025

- Add ImportError handling to silence known import-related exceptions
- Improve generic exception handling to log error messages- Maintain existing specific exception handlers for license-related issues

66f0899a

feat(model_init): implement license verification for Ascend plugin · d5f6fbc6

myhloli authored Feb 21, 2025

- Add license verification logic for Ascend plugin
- Handle different license-related exceptions with appropriate error messages
- Log success message with license expiration date if verification passes
- Fall back to CPU model if license verification fails or plugin is not available

d5f6fbc6

refactor(magic_pdf): improve title optimization process · 54940c61

myhloli authored Feb 21, 2025

- Update instructions for AI-generated titles optimization
- Use ast.literal_eval() instead of json.loads() for parsing completion content
- Refactor variable names and logging for better code readability- Add error handling for JSON decoding issues

54940c61

18 Feb, 2025 3 commits
- fix: update figure caption match algorithm · f731fcab
  icecraft authored Feb 18, 2025
  
  f731fcab
- fix: update figure caption match algorithm · 0793da41
  icecraft authored Feb 18, 2025
  
  0793da41
- fix: caption match algorithm · daf0593b
  icecraft authored Feb 18, 2025
  
  daf0593b
14 Feb, 2025 1 commit
- fix(pdf_parse): Fixed the issue where some headings were missing in certain complex layouts. · 30bd3a83
  myhloli authored Feb 14, 2025
  
  30bd3a83
11 Feb, 2025 2 commits

fix(model): move environment variable settings to global scope · f5112e21

myhloli authored Feb 11, 2025

- Move environment variable settings for NPU, MPS, and other configurations to the global scope in doc_analyze_by_custom_model.py
- Remove redundant environment variable settings in pdf_extract_kit.py
- This change ensures consistent configuration across the application and avoids potential conflicts or duplicate settings

f5112e21

refactor(magic_pdf): improve code structure and memory safety · 4021abeb
myhloli authored Feb 11, 2025

4021abeb

10 Feb, 2025 2 commits

refactor(model_init): adjust table model import order and remove redundant imports · 4c0af020

myhloli authored Feb 10, 2025

- Remove redundant imports for StructTableModel and TableMasterPaddleModel
- Reorder imports to group related modules together
- Update import structure for better readability and maintainability

4c0af020

refactor(model): integrate Ascend plugin for NPU support · 7c76d361

myhloli authored Feb 10, 2025

- Remove unused utility functions
- Update import statements for better readability
- Add conditional imports for Ascend plugin
- Refactor table model initialization to support NPU

7c76d361

09 Feb, 2025 4 commits
- fix(pdf_parse): improve image processing and OCR accuracy · 5561ac95
  myhloli authored Feb 09, 2025
```
- Update calculate_contrast function to support both RGB and BGR image modes
- Add input validation for image mode in calculate_contrast function
- Modify usage of calculate_contrast function in OCR processing to specify image mode
```
  5561ac95
- perf(language_detection): optimize batch size for language detection model · e4e4eef1
  myhloli authored Feb 09, 2025
```
- Increase batch size from 8 to 256 for language detection inference
- Add timing measurement for language detection process
```
  e4e4eef1
- fix(filter): toggle invalid character detection method · a5342950
  myhloli authored Feb 09, 2025
  
  a5342950
- refactor(filter): remove unused text layout analysis for PDF classification · f35a6c08
  myhloli authored Feb 09, 2025
  
  f35a6c08
08 Feb, 2025 2 commits

feat(pdf_parse): improve OCR processing and contrast filtering · 9f18ca20

myhloli authored Feb 08, 2025

- Rename empty_spans to need_ocr_spans for better clarity
- Add calculate_contrast function to measure image contrast
- Filter out low-contrast spans to improve OCR accuracy
- Update OCR processing workflow to use new filtering method

9f18ca20

refactor(magic_pdf): update invalid character detection logic · 5aa809ff
myhloli authored Feb 08, 2025
```
- Uncomment detect_invalid_chars_by_pymupdf function call
- Comment out detect_invalid_chars function call
```
5aa809ff

07 Feb, 2025 1 commit

perf(model): optimize batch ratio for different GPU memory sizes · b1ac7afd

myhloli authored Feb 07, 2025

- Update batch ratio calculation logic to better utilize available GPU memory
- Improve logging for all GPU memory sizes

b1ac7afd

27 Jan, 2025 2 commits
- perf(model): adjust batch ratio for different GPU memory sizes · 29e7a948
  myhloli authored Jan 27, 2025
  
  29e7a948
- perf(model): adjust batch ratio for GPU memory range · d1af4566
  myhloli authored Jan 27, 2025
```
- Update batch ratio calculation for GPU memory range
- Increase upper limit for batch ratio 16 from 24 to 32 GB
```
  d1af4566
23 Jan, 2025 1 commit
- Update version.py with new version · 4211c74c
  myhloli authored Jan 23, 2025
  
  4211c74c
22 Jan, 2025 3 commits

feat(pdf_parse_union_core_v2): add timing log for LLM aided processes · 10e848b3

myhloli authored Jan 22, 2025

- Add timing measurement for formula, text, and title optimization using LLM
- Log the execution time for each LLM aided process

10e848b3

fix(boxbase): handle cases where bounding box area is zero · c38060d5

myhloli authored Jan 22, 2025

- Add a check to return 0 when either bbox1_area or bbox2_area is zero
- This prevents division by zero errors when calculating IoU

c38060d5

refactor(pdf_parse): uncomment char bbox validation logic · 1d08865f

myhloli authored Jan 22, 2025

- Restore commented code for filtering out characters with invalid bounding boxes
- This change may affect the filtering of unnecessary characters in PDF parsing

1d08865f

21 Jan, 2025 5 commits

fix(magic_pdf): correct batch ratio conditions for GPU memory · b6710b99

myhloli authored Jan 21, 2025

- Update conditions for batch ratio assignment:
  -8 <= gpu_memory < 10: batch_ratio = 2 - 10 <= gpu_memory <= 12: batch_ratio =4
- This fix ensures proper batch ratio selection for GPU memory sizes

b6710b99

perf(magic_pdf): optimize batch processing for GPU · 55447c8b

myhloli authored Jan 21, 2025

- Improve batch ratio calculation based on GPU memory
- Enhance performance for devices with 8GB or more VRAM

55447c8b

perf(magic_pdf): adjust batch ratio calculation for GPU memory · 037736fb

myhloli authored Jan 21, 2025

- Reduce batch_ratio by 1 for better performance and stability
- This change ensures more consistent memory usage when processing documents

037736fb

refactor(magic_pdf): adjust VRAM allocation and MFR batch size- Update VRAM... · e74a2960

myhloli authored Jan 21, 2025

refactor(magic_pdf): adjust VRAM allocation and MFR batch size- Update VRAM allocation logic to use 'VIRTUAL_VRAM_SIZE' environment variable
- Reduce MFR (Math Formula Recognition) batch size from 64 to 32

e74a2960

perf(magic_pdf): optimize batch ratio calculation for GPU · 052a4d72

myhloli authored Jan 21, 2025

- Update GPU memory check and batch ratio calculation logic
- Add support for virtual VRAM size environment variable
- Improve logging for GPU memory and batch ratio

052a4d72