Commits · 104273cc79ecc921020e962238014e2ce2dad9ba · wangsen / MinerU

03 Dec, 2024 2 commits

fix(vram): improve VRAM checking logic · 104273cc

myhloli authored Dec 03, 2024

- Update VRAM checking logic in app.py and model_utils.py
- Add None and type checks for VRAM values
- Adjust concurrency limit calculation in app.py
- Modify clean_vram function to handle cases with no VRAM information

104273cc

feat(gradio_app): implement dynamic concurrency limit based on VRAM · b1fe9d4f

myhloli authored Dec 03, 2024

- Add get_concurrency_limit function to calculate concurrency limit based on VRAM
- Update clean_vram function and rename to get_vram for better clarity
- Apply concurrency limit to the to_markdown function in the Gradio app

b1fe9d4f

02 Dec, 2024 3 commits
- Update version.py with new version · b9f3435c
  myhloli authored Dec 02, 2024
  
  b9f3435c
- fix: reduce maximum image size · b0529b6f
  myhloli authored Dec 02, 2024
```
- Decrease the maximum width and height from 9000 to 4500 pixels
- This change aims to prevent excessive resource usage when rendering PDFs
```
  b0529b6f
- fix(pre_proc): prevent errors when imageWriter is None · 7f8dc353
  myhloli authored Dec 02, 2024
```
- Updated cut_image.py to check for NoneType imageWriter
- Prevents AttributeError when imageWriter is not provided
```
  7f8dc353
30 Nov, 2024 1 commit

refactor(para): adjust line height multiplier for block splitting · 41545a13

myhloli authored Dec 01, 2024

- Decrease the line height multiplier from 0.8 to 0.7 for both left and right sides
- This modification aims to improve the accuracy of paragraph splitting

41545a13

29 Nov, 2024 8 commits

Update version.py with new version · f8828be7
myhloli authored Nov 29, 2024

f8828be7

refactor: modify bbox processing for layout separation · b3127233

myhloli authored Nov 30, 2024

- Remove overlap between bboxes for block separation
- Sort bboxes by combined x and y coordinates for better layout handling
- Comment out previous overlap removal function

b3127233

refactor(mkcontent): optimize paragraph text merging and language detection · b80befe9

myhloli authored Nov 30, 2024

- Extract language detection to block level instead of line level
- Improve logic for handling Chinese, Japanese, and Korean languages
- Refactor code for better readability and performance
- Optimize handling of hyphenated words at line ends

b80befe9

feat(ocr_mkcontent): add language detection for line spacing · c8cabb3c

myhloli authored Nov 30, 2024

- Introduce language detection to determine line spacing based on language context
- Implement different spacing rules for Chinese/Japanese/Korean and Western texts
- Adjust span content handling based on detected language and span type

c8cabb3c

Update version.py with new version · d19911f1
myhloli authored Nov 29, 2024

d19911f1
refactor(ocr): Fix the error of paddleocr failing to initialize in a multi-threaded environment · 7f2f2c0f
myhloli authored Nov 29, 2024

7f2f2c0f

refactor(pdf_parse): adjust character-axis alignment algorithm · d4345b6e

myhloli authored Nov 29, 2024

- Introduce `span_height_radio` parameter to calculate_char_in_span function
- Replace fixed ratio with dynamic ratio for character and span axis alignment
- Improve flexibility and accuracy of character placement within spans

d4345b6e

fix(ocr_mkcontent): handle empty paragraphs on pages · 782e6571

myhloli authored Nov 29, 2024

- Add empty paragraph handling for pages with no content
- Append an empty markdown object when a page has no paragraphs
- Increment page number even if no content is present

782e6571

28 Nov, 2024 7 commits

feat(pdf_parse): add line start flag detection and optimize line stop flag logic · 949d0867

myhloli authored Nov 28, 2024

- Add LINE_START_FLAG tuple to identify starting flags of a line
- Modify calculate_char_in_span function to handle both line start and stop flags
- Remove redundant char_is_line_stop_flag variable and simplify logic
- Improve line flag detection to enhance text extraction accuracy

949d0867

refactor(pdf_check): improve character detection using PyMuPDF · ac888156

myhloli authored Nov 28, 2024

- Replace pdfminer with PyMuPDF for character detection
- Implement new method detect_invalid_chars_by_pymupdf
- Update check_invalid_chars in pdf_meta_scan.py to use new method
- Add __replace_0xfffd function in pdf_parse_union_core_v2.py to handle special characters
- Remove unused imports and update requirements.txt

ac888156

refactor(ocr): improve text processing and span handling · 88c0854a

myhloli authored Nov 28, 2024

- Remove unused language detection code
- Simplify text content processing logic
- Update span sorting and text extraction in pdf_parse_union_core_v2.py

88c0854a

feat(pdf_parse): filter out skewed text lines · 37da8c44

myhloli authored Nov 28, 2024

- Add direction filtering to ignore highly skewed text lines
- Improve text extraction accuracy by focusing on non-skewed content

37da8c44

refactor(para): improve language detection and block splitting · f674b8d4

myhloli authored Nov 28, 2024

- Add language detection for each block of text
- Implement language-specific logic for right margin alignment
- Introduce logging for debugging purposes

f674b8d4

fix(Hybrid OCR):Enable Hybrid OCR for Empty Spans That Contain a Certain... · 08392d63
myhloli authored Nov 28, 2024
```
fix(Hybrid OCR):Enable Hybrid OCR for Empty Spans That Contain a Certain Number of Placeholders but No Actual Text
```
08392d63
fix(lite_model): Adapt iite Mode to the Hybrid OCR Mode in Version 0.10 · 9b4d77dc
myhloli authored Nov 28, 2024

9b4d77dc

27 Nov, 2024 5 commits

Update version.py with new version · 52ef1bc7
myhloli authored Nov 27, 2024

52ef1bc7
refactor(pdf_parse_union_core_v2): optimize page processing time logging · 1d2eb70a
myhloli authored Nov 27, 2024

1d2eb70a

refactor(ocr): remove unused functions and optimize OCR processing loop · 5f4410b4

myhloli authored Nov 27, 2024

- Remove unused function `calculate_angle_degrees`- Refactor `calculate_is_angle` to use directly in OCR processing
- Eliminate unnecessary loop index `idx` in OCR processing loops

5f4410b4

refactor(pre_proc): clean up OCR processing code · a46b12e9

myhloli authored Nov 27, 2024

- Remove commented-out code in ocr_dict_merge.py
- Improve imports and code organization in ocr_detect_all_bboxes.py
- Delete unnecessary empty lines and improve code readability

a46b12e9

refactor(libs): remove unused imports and functions · 2db3c263

myhloli authored Nov 27, 2024

- Remove unused imports from commons.py
- Delete unused functions related to AWS and S3 operations
- Update import statements in other modules to reflect changes in commons.py
- Remove redundant code and improve code readability

2db3c263

26 Nov, 2024 8 commits
- perf(image_processing): reduce maximum image size for analysis · b3644157
  myhloli authored Nov 26, 2024
```
- Decrease the maximum image size threshold from 9000 to 4500 pixels
- This change aims to improve performance and reduce memory usage
- Affects the custom model document analysis process
```
  b3644157
- refactor: remove deprecated markdown_utils function · ce202d92
  myhloli authored Nov 26, 2024
  
  ce202d92
- refactor(pre_proc): remove unused functions and simplify code · 21fa7819
  myhloli authored Nov 26, 2024
```
- Remove unused imports and functions across multiple files
- Simplify code by deleting unnecessary comments and empty lines
- Update function signatures to match actual usage
- Replace redundant code with more efficient alternatives
```
  21fa7819
- refactor(magic_pdf): remove unused functions and simplify code · 6a22b5ab
  myhloli authored Nov 26, 2024
  
  6a22b5ab
- refactor(magic_pdf): remove unused functions and simplify code · ecdaa49a
  myhloli authored Nov 26, 2024
  
  ecdaa49a
- feat(pdf_parse): improve text extraction for vertical spans · 81635062
  myhloli authored Nov 26, 2024
```
- Calculate median span height to identify vertical spans
- Use PyMuPDF's 'dict' output to fill vertical spans with lines
```
  81635062
- feat(pdf_parse): add OCR score to span data · 7d4dfca2
  myhloli authored Nov 26, 2024
```
- Add OCR score to span dictionary when OCR text is applied
- Improve data integrity by including confidence score
```
  7d4dfca2
- feat(ocr): filter out low confidence ocr results · eb45a0e8
  myhloli authored Nov 26, 2024
```
- Add confidence score threshold to filter out low confidence OCR results
- Improve OCR accuracy by ignoring less certain detections
```
  eb45a0e8
25 Nov, 2024 6 commits
- refactor(para): improve block merging logic in para_split_v3.py · 160624bd
  myhloli authored Nov 25, 2024
```
- Add checks for uppercase character start in the first span of a block
```
  160624bd
- refactor(pdf_parse): improve text content extraction from PDF spans · 14656085
  myhloli authored Nov 25, 2024
```
- Optimize character sorting for accurate text assembly
- Handle empty char scenarios to prevent errors
- Remove unnecessary comments and improve code readability
- Enhance OCR text content handling by removing low-confidence spans
```
  14656085
- refactor(pdf_parse): improve code readability and maintainability · 7964ae45
  myhloli authored Nov 25, 2024
  
  7964ae45
- refactor(pdf_parse): improve code readability and maintainability · 97bcc8b2
  myhloli authored Nov 25, 2024
  
  97bcc8b2
- refactor(txt_spans_extract_v2): optimize span processing and OCR logic · 034c59a8
  myhloli authored Nov 25, 2024
```
- Merge useful_spans and unuseful_spans handling
- Simplify overlap ratio calculation and block type checking
- Remove unnecessary span removal and re-addition
```
  034c59a8
- fix(pdf_parse): Move the logic for filling text content into spans before the... · 0d3ef89f
  myhloli authored Nov 25, 2024
```
fix(pdf_parse): Move the logic for filling text content into spans before the discarded_block recognition to fix the issue of empty text blocks in discarded_block.
```
  0d3ef89f