Commits · 4da3c0f5c0f97d90ecda3567502a6c91ef660ce7 · wangsen / MinerU

04 Mar, 2025 2 commits
- Update version.py with new version · 4da3c0f5
  myhloli authored Mar 04, 2025
  
  4da3c0f5
- refactor(magic_pdf): improve paragraph splitting logic and update dependencies · 842483cc
  myhloli authored Mar 04, 2025
```
- Optimize paragraph splitting algorithm for better text block separation
- Update fast-langdetect dependency to ensure compatibility
```
  842483cc
03 Mar, 2025 9 commits

Update version.py with new version · da0c2eaa
myhloli authored Mar 03, 2025

da0c2eaa

perf(inference): adjust batch ratio for high GPU memory · 0b05dff7

myhloli authored Mar 03, 2025

- Increase batch ratio to 8 for GPU memory >=16GB
- Improve inference performance on systems with higher GPU memory

0b05dff7

refactor(pre_proc): allow interline equations to be associated with text blocks · 083b787c

myhloli authored Mar 03, 2025

- Update OCR dictionary merge logic to include text blocks when processing interline equations
- This change improves the handling of equations that may be embedded within text content

083b787c

fix: caption match · fb02be19
icecraft authored Mar 03, 2025

fb02be19

perf(inference): adjust batch ratio for GPU memory sizes · 58b6ad8c

myhloli authored Mar 03, 2025

- Simplify batch ratio logic for GPU memory >= 16GB
- Remove unnecessary conditions for 20GB and 40GB memory

58b6ad8c

perf(inference): adjust batch ratio for GPU memory sizes · 0d3304d7

myhloli authored Mar 03, 2025

- Simplify batch ratio logic for GPU memory >= 16GB
- Remove unnecessary conditions for 20GB and 40GB memory

0d3304d7

perf(mfr): improve Math Formula Recognition by sorting images by area · 59fc80d4

myhloli authored Mar 03, 2025

- Sort detected images by area before processing to enhance MFR accuracy
- Implement stable sorting to maintain original order of images with equal

59fc80d4

refactor(pdf_parse): comment out performance measurement and logging · 6bfc1711

myhloli authored Mar 03, 2025

- Comment out @measure_time decorator for txt_spans_extract_v2 and sort_lines_by_model functions
- Remove logger.info for page_process_time
- Comment out PerformanceStats.print_stats call

6bfc1711

feat(performance): add performance monitoring and optimization · e516cf53

myhloli authored Mar 03, 2025

- Add performance_stats module to measure and print execution time statistics
- Implement measure_time decorator to track execution time of key functions
- Remove multi-threading in pdf parsing for better resource management
- Optimize pdf parsing logic for improved performance

e516cf53

28 Feb, 2025 1 commit

feat(pdf_parse): implement multi-threaded page processing · 6ec440d6

myhloli authored Feb 28, 2025

- Add ThreadPoolExecutor to process PDF pages in parallel
- Create separate function for page processing to improve readability and maintainability
- Include error handling for individual page processing tasks
- Log total page processing time for performance monitoring

6ec440d6

27 Feb, 2025 2 commits
- refactor(ocr_mkcontent): optimize full-width character handling · df1b8f59
  myhloli authored Feb 27, 2025
```
- Update condition to only convert full-width letters and numbers
- Remove separate case for full-width space
```
  df1b8f59
- Update version.py with new version · d64182ea
  myhloli authored Feb 27, 2025
  
  d64182ea
26 Feb, 2025 3 commits

fix: match multiple captions · 15cd97ff
icecraft authored Feb 26, 2025

15cd97ff

refactor(magic_pdf): simplify device selection in model initialization · 0a246f0f

myhloli authored Feb 26, 2025

- Replace complex device selection logic with a single line using torch.device
- Remove redundant checks and imports for better readability and maintainability

0a246f0f

refactor(magic_pdf): remove bfloat16 support checks and usage · 9b00f988

myhloli authored Feb 26, 2025

- Remove supports_bfloat16 variable and related checks
- Remove model.bfloat16() call for LayoutLMv3ForTokenClassification
- Simplify device selection logic

9b00f988

25 Feb, 2025 2 commits

feat(ocr_mkcontent): add full-width to half-width character conversion · 315adbce

myhloli authored Feb 25, 2025

- Implement full_to_half function to convert full-width characters to half-width
- Apply conversion to span content before merging paragraphs
- Improve text processing for better readability and consistency

315adbce

perf(model): optimize batch analyze process · 6753df8d

myhloli authored Feb 25, 2025

- Move batch model initialization outside the loop
- Collect page dimensions before analyzing- Update page info dictionary structure
- Add null dimensions for non-analyzed pages

6753df8d

24 Feb, 2025 3 commits

feat(pre_proc): add block type compatibility check for span allocation · 19916856

myhloli authored Feb 24, 2025

- Introduce span_block_type_compatible function to check compatibility between span and block types
- Update fill_spans_in_blocks function to use the new compatibility check
- Improve accuracy of span allocation to blocks based on content type

19916856

fix(llm_aided): update prompt · 9e332f06
myhloli authored Feb 24, 2025

9e332f06

fix(magic_pdf): correct negative indexing for `end_page_id` · 90a27ecd

myhloli authored Feb 24, 2025

- Update the logic for determining `end_page_id` to handle negative values
- This change ensures proper behavior when `end_page_id` is set to -1 or other negative values

90a27ecd

23 Feb, 2025 1 commit

chore(magic_pdf): enhance license logging information · 3fe315d8

myhloli authored Feb 23, 2025

- Add license ID information to the log for better traceability
- Improve logging format to include both license ID and expiration date

3fe315d8

22 Feb, 2025 1 commit
- fix doc_analyze first page only · 37f3e200
  Nathan Dahlberg authored Feb 22, 2025
  
  37f3e200
21 Feb, 2025 3 commits

fix(model): handle import errors and improve exception logging · 66f0899a

myhloli authored Feb 21, 2025

- Add ImportError handling to silence known import-related exceptions
- Improve generic exception handling to log error messages- Maintain existing specific exception handlers for license-related issues

66f0899a

feat(model_init): implement license verification for Ascend plugin · d5f6fbc6

myhloli authored Feb 21, 2025

- Add license verification logic for Ascend plugin
- Handle different license-related exceptions with appropriate error messages
- Log success message with license expiration date if verification passes
- Fall back to CPU model if license verification fails or plugin is not available

d5f6fbc6

refactor(magic_pdf): improve title optimization process · 54940c61

myhloli authored Feb 21, 2025

- Update instructions for AI-generated titles optimization
- Use ast.literal_eval() instead of json.loads() for parsing completion content
- Refactor variable names and logging for better code readability- Add error handling for JSON decoding issues

54940c61

18 Feb, 2025 3 commits
- fix: update figure caption match algorithm · f731fcab
  icecraft authored Feb 18, 2025
  
  f731fcab
- fix: update figure caption match algorithm · 0793da41
  icecraft authored Feb 18, 2025
  
  0793da41
- fix: caption match algorithm · daf0593b
  icecraft authored Feb 18, 2025
  
  daf0593b
14 Feb, 2025 1 commit
- fix(pdf_parse): Fixed the issue where some headings were missing in certain complex layouts. · 30bd3a83
  myhloli authored Feb 14, 2025
  
  30bd3a83
11 Feb, 2025 2 commits

fix(model): move environment variable settings to global scope · f5112e21

myhloli authored Feb 11, 2025

- Move environment variable settings for NPU, MPS, and other configurations to the global scope in doc_analyze_by_custom_model.py
- Remove redundant environment variable settings in pdf_extract_kit.py
- This change ensures consistent configuration across the application and avoids potential conflicts or duplicate settings

f5112e21

refactor(magic_pdf): improve code structure and memory safety · 4021abeb
myhloli authored Feb 11, 2025

4021abeb

10 Feb, 2025 2 commits

refactor(model_init): adjust table model import order and remove redundant imports · 4c0af020

myhloli authored Feb 10, 2025

- Remove redundant imports for StructTableModel and TableMasterPaddleModel
- Reorder imports to group related modules together
- Update import structure for better readability and maintainability

4c0af020

refactor(model): integrate Ascend plugin for NPU support · 7c76d361

myhloli authored Feb 10, 2025

- Remove unused utility functions
- Update import statements for better readability
- Add conditional imports for Ascend plugin
- Refactor table model initialization to support NPU

7c76d361

09 Feb, 2025 4 commits
- fix(pdf_parse): improve image processing and OCR accuracy · 5561ac95
  myhloli authored Feb 09, 2025
```
- Update calculate_contrast function to support both RGB and BGR image modes
- Add input validation for image mode in calculate_contrast function
- Modify usage of calculate_contrast function in OCR processing to specify image mode
```
  5561ac95
- perf(language_detection): optimize batch size for language detection model · e4e4eef1
  myhloli authored Feb 09, 2025
```
- Increase batch size from 8 to 256 for language detection inference
- Add timing measurement for language detection process
```
  e4e4eef1
- fix(filter): toggle invalid character detection method · a5342950
  myhloli authored Feb 09, 2025
  
  a5342950
- refactor(filter): remove unused text layout analysis for PDF classification · f35a6c08
  myhloli authored Feb 09, 2025
  
  f35a6c08
08 Feb, 2025 1 commit

feat(pdf_parse): improve OCR processing and contrast filtering · 9f18ca20

myhloli authored Feb 08, 2025

- Rename empty_spans to need_ocr_spans for better clarity
- Add calculate_contrast function to measure image contrast
- Filter out low-contrast spans to improve OCR accuracy
- Update OCR processing workflow to use new filtering method

9f18ca20