Commits · 52ef1bc782ddf8fac0fae519fd61425eea3e5786 · wangsen / MinerU

"docker/Dockerfile.sagemaker" did not exist on "8832ecb1e451a58a85cbdcd7029586187c1c9574"

27 Nov, 2024 5 commits

Update version.py with new version · 52ef1bc7
myhloli authored Nov 27, 2024

52ef1bc7
refactor(pdf_parse_union_core_v2): optimize page processing time logging · 1d2eb70a
myhloli authored Nov 27, 2024

1d2eb70a

refactor(ocr): remove unused functions and optimize OCR processing loop · 5f4410b4

myhloli authored Nov 27, 2024

- Remove unused function `calculate_angle_degrees`- Refactor `calculate_is_angle` to use directly in OCR processing
- Eliminate unnecessary loop index `idx` in OCR processing loops

5f4410b4

refactor(pre_proc): clean up OCR processing code · a46b12e9

myhloli authored Nov 27, 2024

- Remove commented-out code in ocr_dict_merge.py
- Improve imports and code organization in ocr_detect_all_bboxes.py
- Delete unnecessary empty lines and improve code readability

a46b12e9

refactor(libs): remove unused imports and functions · 2db3c263

myhloli authored Nov 27, 2024

- Remove unused imports from commons.py
- Delete unused functions related to AWS and S3 operations
- Update import statements in other modules to reflect changes in commons.py
- Remove redundant code and improve code readability

2db3c263

26 Nov, 2024 8 commits
- perf(image_processing): reduce maximum image size for analysis · b3644157
  myhloli authored Nov 26, 2024
```
- Decrease the maximum image size threshold from 9000 to 4500 pixels
- This change aims to improve performance and reduce memory usage
- Affects the custom model document analysis process
```
  b3644157
- refactor: remove deprecated markdown_utils function · ce202d92
  myhloli authored Nov 26, 2024
  
  ce202d92
- refactor(pre_proc): remove unused functions and simplify code · 21fa7819
  myhloli authored Nov 26, 2024
```
- Remove unused imports and functions across multiple files
- Simplify code by deleting unnecessary comments and empty lines
- Update function signatures to match actual usage
- Replace redundant code with more efficient alternatives
```
  21fa7819
- refactor(magic_pdf): remove unused functions and simplify code · 6a22b5ab
  myhloli authored Nov 26, 2024
  
  6a22b5ab
- refactor(magic_pdf): remove unused functions and simplify code · ecdaa49a
  myhloli authored Nov 26, 2024
  
  ecdaa49a
- feat(pdf_parse): improve text extraction for vertical spans · 81635062
  myhloli authored Nov 26, 2024
```
- Calculate median span height to identify vertical spans
- Use PyMuPDF's 'dict' output to fill vertical spans with lines
```
  81635062
- feat(pdf_parse): add OCR score to span data · 7d4dfca2
  myhloli authored Nov 26, 2024
```
- Add OCR score to span dictionary when OCR text is applied
- Improve data integrity by including confidence score
```
  7d4dfca2
- feat(ocr): filter out low confidence ocr results · eb45a0e8
  myhloli authored Nov 26, 2024
```
- Add confidence score threshold to filter out low confidence OCR results
- Improve OCR accuracy by ignoring less certain detections
```
  eb45a0e8
25 Nov, 2024 7 commits
- refactor(para): improve block merging logic in para_split_v3.py · 160624bd
  myhloli authored Nov 25, 2024
```
- Add checks for uppercase character start in the first span of a block
```
  160624bd
- refactor(pdf_parse): improve text content extraction from PDF spans · 14656085
  myhloli authored Nov 25, 2024
```
- Optimize character sorting for accurate text assembly
- Handle empty char scenarios to prevent errors
- Remove unnecessary comments and improve code readability
- Enhance OCR text content handling by removing low-confidence spans
```
  14656085
- refactor(pdf_parse): improve code readability and maintainability · 7964ae45
  myhloli authored Nov 25, 2024
  
  7964ae45
- refactor(pdf_parse): improve code readability and maintainability · 97bcc8b2
  myhloli authored Nov 25, 2024
  
  97bcc8b2
- refactor(txt_spans_extract_v2): optimize span processing and OCR logic · 034c59a8
  myhloli authored Nov 25, 2024
```
- Merge useful_spans and unuseful_spans handling
- Simplify overlap ratio calculation and block type checking
- Remove unnecessary span removal and re-addition
```
  034c59a8
- fix(pdf_parse): Move the logic for filling text content into spans before the... · 0d3ef89f
  myhloli authored Nov 25, 2024
```
fix(pdf_parse): Move the logic for filling text content into spans before the discarded_block recognition to fix the issue of empty text blocks in discarded_block.
```
  0d3ef89f
- Update version.py with new version · 9d6be7c9
  myhloli authored Nov 25, 2024
  
  9d6be7c9
24 Nov, 2024 2 commits
- fix: remove unused file · e9ace3eb
  icecraft authored Nov 24, 2024
  
  e9ace3eb
- fix: rewrite projects/ and demos with new data api · b1adde8e
  icecraft authored Nov 24, 2024
  
  b1adde8e
22 Nov, 2024 5 commits

Update version.py with new version · 0624b565
myhloli authored Nov 22, 2024

0624b565

fix(pdf_parse): improve OCR result handling · 6b296ee2

myhloli authored Nov 22, 2024

- Add null check for OCR results to prevent errors on empty lists
- Enhance robustness of OCR text processing in the magic-pdf project

6b296ee2

refactor(model): move page total time logging to custom model analysis · f1e2f084

myhloli authored Nov 22, 2024

- Move page total time logging to doc_analyze_by_custom_model.py
- Remove page total time logging from pdf_extract_kit.py
- Add page_start timing variable to custom model analysis
- Update logger output format for page total time

f1e2f084

fix(table): add null check for OCR result in rapid table prediction · 18aa1a20

myhloli authored Nov 22, 2024

- Add a null check for OCR result in the predict method
- Return None values if OCR result is None to prevent further processing

18aa1a20

refactor(para): improve line stop flag and remove unused debug mode · 5d6cbcb1

myhloli authored Nov 22, 2024

- Add '-' and '–' to LINE_STOP_FLAG in pdf_parse_union_core_v2.py
- Remove unused debug_mode parameter from para_split function in para_split_v3.py

5d6cbcb1

21 Nov, 2024 7 commits

fix(pdf_parse): improve line stop flag detection accuracy · ae3b0a1e

myhloli authored Nov 22, 2024

- Add an additional condition to the line stop flag check
- Ensure character is to the right of the span's left boundary
- This change helps reduce false positives in line stop detection

ae3b0a1e

refactor(txt_parse): improve text extraction accuracy with new algorithm · 309be741

myhloli authored Nov 21, 2024

- Implement new text extraction method (txt_spans_extract_v2) to enhance accuracy
- Add character filling in spans for better text reconstruction
- Introduce empty span handling using OCR for missed text
- Optimize span filtering and overlap removal

309be741

feat(ocr): improve text detection and OCR accuracy · b2e37a2d

myhloli authored Nov 21, 2024

- Update OCR utils to handle different box formats and improve angle calculation
- Modify PDF extraction kit to support OCR option and optimize processing flow
- Enhance PPOCR model to sort and filter detection boxes, improving text splitting accuracy

b2e37a2d

fix(remove_overlaps_min_spans): optimize overlap detection in OCR span list modification · e4810cee
myhloli authored Nov 21, 2024
```
- Improve logic to skip dropped spans in overlap detection
- Enhance efficiency by avoiding unnecessary comparisons
```
e4810cee
fix(ocr_mkcontent): improve hyphen handling at line ends · a07007e5
myhloli authored Nov 21, 2024
```
- fix the bug where hyphens in the middle of a line are being discarded
```
a07007e5

refactor(ocr_dict_merge): add threshold parameter for line merging · b9f78c9b

myhloli authored Nov 21, 2024

- Add threshold parameter to merge_spans_to_line function
- Make threshold configurable for y-axis overlap check
- Improve flexibility and accuracy of line merging algorithm

b9f78c9b

fix(tools): handle empty language string in common.py · 20ed0cd5

myhloli authored Nov 21, 2024

- Check if language string is empty and set it to None
- This prevents potential errors when an empty language string is passed

20ed0cd5

20 Nov, 2024 1 commit
- fix: remove test code · 22008b82
  icecraft authored Nov 20, 2024
  
  22008b82
19 Nov, 2024 2 commits
- refactor: move some constants or enums defs to config folder · b492c19c
  icecraft authored Nov 19, 2024
  
  b492c19c
- delete unused pipeline file (#1024) · bc992433
  Alex Liu authored Nov 19, 2024
  
  bc992433
18 Nov, 2024 3 commits

refactor(para): adjust right margin threshold based on block width · 69805f4b

myhloli authored Nov 19, 2024

- Introduce a variable threshold for right margin based on block width
- Use 0.26 * block_weight for wider blocks (block_weight_radio >= 0.5)
- Use 0.36 * block_weight for narrower blocks- This change aims to improve paragraph splitting accuracy for different block widths

69805f4b

refactor(para): improve paragraph splitting logic · 517fbe5b

myhloli authored Nov 18, 2024

- Add page size information to blocks
- Calculate block width ratio relative to page width
- Adjust threshold for determining right side indentation
- Implement additional checks for merging blocks across pages
- Improve logic for identifying list structures

517fbe5b

feat(ocr): improve handling of angled text boxes · 4fd966eb

myhloli authored Nov 18, 2024

- Add calculate_is_angle function to detect angled text boxes
- Update update_det_boxes and merge_det_boxes functions to handle angled text boxes
- Modify angle detection logic in various parts of the code

4fd966eb