Commits · 6ae50fead8ede57c8a5644a42c45e72f9c5f2377 · wangsen / MinerU

27 Nov, 2024 6 commits

docs(README): remove code examples and redirect to documentation · 6ae50fea

myhloli authored Nov 27, 2024

- Remove command line and API code examples from README files
- Add links to online documentation for command line and API usage
- Update content to point users to the new locations for detailed information

6ae50fea

refactor(ocr): remove unused functions and optimize OCR processing loop · 5f4410b4

myhloli authored Nov 27, 2024

- Remove unused function `calculate_angle_degrees`- Refactor `calculate_is_angle` to use directly in OCR processing
- Eliminate unnecessary loop index `idx` in OCR processing loops

5f4410b4

refactor(pre_proc): clean up OCR processing code · a46b12e9

myhloli authored Nov 27, 2024

- Remove commented-out code in ocr_dict_merge.py
- Improve imports and code organization in ocr_detect_all_bboxes.py
- Delete unnecessary empty lines and improve code readability

a46b12e9

refactor(libs): remove unused imports and functions · 2db3c263

myhloli authored Nov 27, 2024

- Remove unused imports from commons.py
- Delete unused functions related to AWS and S3 operations
- Update import statements in other modules to reflect changes in commons.py
- Remove redundant code and improve code readability

2db3c263

test: json minify · e937e011
myhloli authored Nov 27, 2024

e937e011
Merge pull request #1104 from icecraft/fix/test_tools_ut · 65a9eedd
Xiaomeng Zhao authored Nov 27, 2024
```
fix: test_tools unittest
```
65a9eedd

26 Nov, 2024 23 commits
- Merge pull request #1106 from myhloli/dev · b53409ea
  Xiaomeng Zhao authored Nov 26, 2024
```
perf(image_processing): reduce maximum image size for analysis
```
  b53409ea
- perf(image_processing): reduce maximum image size for analysis · b3644157
  myhloli authored Nov 26, 2024
```
- Decrease the maximum image size threshold from 9000 to 4500 pixels
- This change aims to improve performance and reduce memory usage
- Affects the custom model document analysis process
```
  b3644157
- Merge pull request #1105 from icecraft/fix/test_rag · eb6d5dc8
  Xiaomeng Zhao authored Nov 26, 2024
```
fix: test_rag
```
  eb6d5dc8
- fix: test_rag · 843d1382
  icecraft authored Nov 26, 2024
  
  843d1382
- fix: test_tools unittest · 5402e270
  icecraft authored Nov 26, 2024
  
  5402e270
- Merge pull request #1102 from myhloli/dev · 4bc29ba3
  Xiaomeng Zhao authored Nov 26, 2024
```
refactor: remove deprecated markdown_utils function
```
  4bc29ba3
- refactor: remove deprecated markdown_utils function · ce202d92
  myhloli authored Nov 26, 2024
  
  ce202d92
- Merge pull request #1101 from myhloli/dev · c884e2ed
  Xiaomeng Zhao authored Nov 26, 2024
```
test: Shield some failed test cases
```
  c884e2ed
- test: Shield some failed test cases · 3064ef83
  myhloli authored Nov 26, 2024
  
  3064ef83
- Merge pull request #1100 from myhloli/dev · d8823885
  Xiaomeng Zhao authored Nov 26, 2024
```
refactor(pre_proc): remove unused functions and simplify code
```
  d8823885
- refactor(pre_proc): remove unused functions and simplify code · 21fa7819
  myhloli authored Nov 26, 2024
```
- Remove unused imports and functions across multiple files
- Simplify code by deleting unnecessary comments and empty lines
- Update function signatures to match actual usage
- Replace redundant code with more efficient alternatives
```
  21fa7819
- Merge pull request #1099 from myhloli/dev · e6da37dd
  Xiaomeng Zhao authored Nov 26, 2024
```
refactor(magic_pdf): remove unused functions and simplify code
```
  e6da37dd
- refactor(magic_pdf): remove unused functions and simplify code · 6a22b5ab
  myhloli authored Nov 26, 2024
  
  6a22b5ab
- Merge pull request #1098 from myhloli/dev · 79b58a1e
  Xiaomeng Zhao authored Nov 26, 2024
```
refactor(magic_pdf): remove unused functions and simplify code
```
  79b58a1e
- refactor(magic_pdf): remove unused functions and simplify code · ecdaa49a
  myhloli authored Nov 26, 2024
  
  ecdaa49a
- Merge pull request #1095 from myhloli/dev · 1ab691fc
  Xiaomeng Zhao authored Nov 26, 2024
```
feat(pdf_parse): improve text extraction for vertical spans
```
  1ab691fc
- feat(pdf_parse): improve text extraction for vertical spans · 81635062
  myhloli authored Nov 26, 2024
```
- Calculate median span height to identify vertical spans
- Use PyMuPDF's 'dict' output to fill vertical spans with lines
```
  81635062
- Merge pull request #1094 from myhloli/dev · 026c23eb
  Xiaomeng Zhao authored Nov 26, 2024
```
test: comment out assertion in test_metascan_classify
```
  026c23eb
- test: comment out assertion in test_metascan_classify · 32c0fe73
  myhloli authored Nov 26, 2024
```
- Disable the assertion for bool_classify_by_text_layout to skip this test
```
  32c0fe73
- Merge pull request #1089 from myhloli/dev · 14f4bbb9
  Xiaomeng Zhao authored Nov 26, 2024
```
feat(pdf_parse): add OCR score to span data
```
  14f4bbb9
- feat(pdf_parse): add OCR score to span data · 7d4dfca2
  myhloli authored Nov 26, 2024
```
- Add OCR score to span dictionary when OCR text is applied
- Improve data integrity by including confidence score
```
  7d4dfca2
- Merge pull request #1088 from myhloli/dev · 9675a574
  Xiaomeng Zhao authored Nov 26, 2024
```
feat(ocr): filter out low confidence ocr results
```
  9675a574
- feat(ocr): filter out low confidence ocr results · eb45a0e8
  myhloli authored Nov 26, 2024
```
- Add confidence score threshold to filter out low confidence OCR results
- Improve OCR accuracy by ignoring less certain detections
```
  eb45a0e8
25 Nov, 2024 11 commits
- Merge pull request #1086 from myhloli/dev · 61e88cb2
  Xiaomeng Zhao authored Nov 25, 2024
```
refactor(txt_spans_extract_v2): optimize span processing and OCR logic
```
  61e88cb2
- refactor(para): improve block merging logic in para_split_v3.py · 160624bd
  myhloli authored Nov 25, 2024
```
- Add checks for uppercase character start in the first span of a block
```
  160624bd
- refactor(pdf_parse): improve text content extraction from PDF spans · 14656085
  myhloli authored Nov 25, 2024
```
- Optimize character sorting for accurate text assembly
- Handle empty char scenarios to prevent errors
- Remove unnecessary comments and improve code readability
- Enhance OCR text content handling by removing low-confidence spans
```
  14656085
- refactor(pdf_parse): improve code readability and maintainability · 7964ae45
  myhloli authored Nov 25, 2024
  
  7964ae45
- refactor(pdf_parse): improve code readability and maintainability · 97bcc8b2
  myhloli authored Nov 25, 2024
  
  97bcc8b2
- refactor(txt_spans_extract_v2): optimize span processing and OCR logic · 034c59a8
  myhloli authored Nov 25, 2024
```
- Merge useful_spans and unuseful_spans handling
- Simplify overlap ratio calculation and block type checking
- Remove unnecessary span removal and re-addition
```
  034c59a8
- Merge pull request #1082 from myhloli/dev · 6c4040ac
  Xiaomeng Zhao authored Nov 25, 2024
```
fix(pdf_parse): Move the logic for filling text content into spans before the discarded_block recognition to fix the issue of empty text blocks in discarded_block.
```
  6c4040ac
- fix(pdf_parse): Move the logic for filling text content into spans before the... · 0d3ef89f
  myhloli authored Nov 25, 2024
```
fix(pdf_parse): Move the logic for filling text content into spans before the discarded_block recognition to fix the issue of empty text blocks in discarded_block.
```
  0d3ef89f
- Merge pull request #1077 from opendatalab/master · aa78df41
  Xiaomeng Zhao authored Nov 25, 2024
```
master -> dev
```
  aa78df41
- Update version.py with new version · 9d6be7c9
  myhloli authored Nov 25, 2024
  
  9d6be7c9
- Merge pull request #1076 from opendatalab/release-0.10.1 · 4dcf31b6
  Xiaomeng Zhao authored Nov 25, 2024
```
Release 0.10.1
```
  4dcf31b6