Commits · 32c0fe733db258ebecf92d9f4da0825296202076 · wangsen / MinerU

26 Nov, 2024 3 commits

test: comment out assertion in test_metascan_classify · 32c0fe73
myhloli authored Nov 26, 2024
```
- Disable the assertion for bool_classify_by_text_layout to skip this test
```
32c0fe73

feat(pdf_parse): add OCR score to span data · 7d4dfca2

myhloli authored Nov 26, 2024

- Add OCR score to span dictionary when OCR text is applied
- Improve data integrity by including confidence score

7d4dfca2

feat(ocr): filter out low confidence ocr results · eb45a0e8

myhloli authored Nov 26, 2024

- Add confidence score threshold to filter out low confidence OCR results
- Improve OCR accuracy by ignoring less certain detections

eb45a0e8

25 Nov, 2024 8 commits

refactor(para): improve block merging logic in para_split_v3.py · 160624bd
myhloli authored Nov 25, 2024
```
- Add checks for uppercase character start in the first span of a block
```
160624bd

refactor(pdf_parse): improve text content extraction from PDF spans · 14656085

myhloli authored Nov 25, 2024

- Optimize character sorting for accurate text assembly
- Handle empty char scenarios to prevent errors
- Remove unnecessary comments and improve code readability
- Enhance OCR text content handling by removing low-confidence spans

14656085

refactor(pdf_parse): improve code readability and maintainability · 7964ae45
myhloli authored Nov 25, 2024

7964ae45
refactor(pdf_parse): improve code readability and maintainability · 97bcc8b2
myhloli authored Nov 25, 2024

97bcc8b2

refactor(txt_spans_extract_v2): optimize span processing and OCR logic · 034c59a8

myhloli authored Nov 25, 2024

- Merge useful_spans and unuseful_spans handling
- Simplify overlap ratio calculation and block type checking
- Remove unnecessary span removal and re-addition

034c59a8

fix(pdf_parse): Move the logic for filling text content into spans before the... · 0d3ef89f

myhloli authored Nov 25, 2024

fix(pdf_parse): Move the logic for filling text content into spans before the discarded_block recognition to fix the issue of empty text blocks in discarded_block.

0d3ef89f

test: batch process demo PDFs- Update test block to iterate through multiple demo PDF files · e11e6b32
myhloli authored Nov 25, 2024
```
- Use os.path.join to construct file paths for better cross-platform compatibility
- Remove hardcoded file path
```
e11e6b32

feat(demo): add visualization bbox parameter and refactor parsing process · 17ef5c0f

myhloli authored Nov 25, 2024

- Add is_draw_visualization_bbox parameter to enable/disable visualization of bounding boxes
- Refactor the parsing process to improve code readability and maintainability
- Update function documentation to reflect new parameter
- Simplify test code by using a more generic variable name

17ef5c0f

24 Nov, 2024 4 commits
- Merge pull request #1071 from icecraft/fix/demo · 29b38d12
  Xiaomeng Zhao authored Nov 24, 2024
```
Fix/demo
```
  29b38d12
- fix: remove unused file · e9ace3eb
  icecraft authored Nov 24, 2024
  
  e9ace3eb
- fix: rewrite projects/ and demos with new data api · ae379e6b
  icecraft authored Nov 24, 2024
  
  ae379e6b
- fix: rewrite projects/ and demos with new data api · b1adde8e
  icecraft authored Nov 24, 2024
  
  b1adde8e
22 Nov, 2024 19 commits
- Merge pull request #1066 from opendatalab/master · 4e0b3a8f
  Xiaomeng Zhao authored Nov 22, 2024
```
master -> dev
```
  4e0b3a8f
- Update FAQ_en_us.md · dc37af0a
  Xiaomeng Zhao authored Nov 22, 2024
  
  dc37af0a
- Update FAQ_zh_cn.md · 6eabc682
  Xiaomeng Zhao authored Nov 22, 2024
  
  6eabc682
- Update version.py with new version · 0624b565
  myhloli authored Nov 22, 2024
  
  0624b565
- Merge pull request #1063 from opendatalab/release-0.10.0 · 158e556b
  Xiaomeng Zhao authored Nov 22, 2024
```
Release 0.10.0
```
  158e556b
- Merge pull request #1065 from opendatalab/dev · 30be5017
  Xiaomeng Zhao authored Nov 22, 2024
```
fix(pdf_parse): improve OCR result handling
```
  30be5017
- Merge pull request #1064 from myhloli/dev · b936cb0c
  Xiaomeng Zhao authored Nov 22, 2024
```
fix(pdf_parse): improve OCR result handling
```
  b936cb0c
- fix(pdf_parse): improve OCR result handling · 6b296ee2
  myhloli authored Nov 22, 2024
```
- Add null check for OCR results to prevent errors on empty lists
- Enhance robustness of OCR text processing in the magic-pdf project
```
  6b296ee2
- Merge pull request #1062 from opendatalab/dev · 809bf479
  Xiaomeng Zhao authored Nov 22, 2024
```
fix(table): add null check for OCR result in rapid table prediction 
```
  809bf479
- Merge pull request #1061 from myhloli/dev · 241d4895
  Xiaomeng Zhao authored Nov 22, 2024
```
refactor(model): move page total time logging to custom model analysis
```
  241d4895
- refactor(model): move page total time logging to custom model analysis · f1e2f084
  myhloli authored Nov 22, 2024
```
- Move page total time logging to doc_analyze_by_custom_model.py
- Remove page total time logging from pdf_extract_kit.py
- Add page_start timing variable to custom model analysis
- Update logger output format for page total time
```
  f1e2f084
- Merge pull request #1060 from myhloli/dev · 0d632833
  Xiaomeng Zhao authored Nov 22, 2024
```
fix(table): add null check for OCR result in rapid table prediction
```
  0d632833
- fix(table): add null check for OCR result in rapid table prediction · 18aa1a20
  myhloli authored Nov 22, 2024
```
- Add a null check for OCR result in the predict method
- Return None values if OCR result is None to prevent further processing
```
  18aa1a20
- Merge pull request #1059 from myhloli/dev · 958168b3
  Xiaomeng Zhao authored Nov 22, 2024
```
feat(README): update for v0.10.0 
```
  958168b3
- Merge remote-tracking branch 'origin/dev' into dev · c6627b68
  myhloli authored Nov 22, 2024
  
  c6627b68
- feat(README): update for v0.10.0 、 · d9cfdad1
  myhloli authored Nov 22, 2024
```
- Introduced hybrid OCR text extraction capabilities in v0.10.0
- Significantly improved parsing performance in complex text distribution scenarios- Combined advantages of accurate content extraction and faster speed in text mode with more precise span/line region recognition in OCR mode
- Updated both English and Chinese README files
```
  d9cfdad1
- Merge pull request #1058 from myhloli/dev · f70246d6
  Xiaomeng Zhao authored Nov 22, 2024
```
refactor(para): improve line stop flag and remove unused debug mode
```
  f70246d6
- refactor(para): improve line stop flag and remove unused debug mode · 5d6cbcb1
  myhloli authored Nov 22, 2024
```
- Add '-' and '–' to LINE_STOP_FLAG in pdf_parse_union_core_v2.py
- Remove unused debug_mode parameter from para_split function in para_split_v3.py
```
  5d6cbcb1
- Add test cases to json compressor util (#1056) · 93208f44
  Alex Liu authored Nov 22, 2024
```
* delete unused pipeline file

* add json test circle

* add size reduction test case

* add serializable test case

* add invalid json compress test case

* add empty test case

* add special char test case
```
  93208f44
21 Nov, 2024 6 commits
- Merge pull request #1054 from myhloli/dev · 5578d77c
  Xiaomeng Zhao authored Nov 22, 2024
```
test: comment out assertions for metascan classify and meta scan tests
```
  5578d77c
- test: comment out assertions for metascan classify and meta scan tests · e7f883f1
  myhloli authored Nov 22, 2024
```
- Commented out assertions in test_metascan_classify/test_classify.py
- Commented out assertions in test_metascan_classify/test_meta_scan.py
- This change affects multiple test cases across both test files
```
  e7f883f1
- Merge pull request #1053 from myhloli/dev · a9281f18
  Xiaomeng Zhao authored Nov 22, 2024
```
fix(pdf_parse): improve line stop flag detection accuracy
```
  a9281f18
- fix(pdf_parse): improve line stop flag detection accuracy · ae3b0a1e
  myhloli authored Nov 22, 2024
```
- Add an additional condition to the line stop flag check
- Ensure character is to the right of the span's left boundary
- This change helps reduce false positives in line stop detection
```
  ae3b0a1e
- Merge pull request #1052 from icecraft/fix/gradio_project_read · 9e4d6a45
  Xiaomeng Zhao authored Nov 21, 2024
```
fix: use concrete class instead of abstract class
```
  9e4d6a45
- fix: use concrete class instead of abstract class · fa3c453c
  icecraft authored Nov 21, 2024
  
  fa3c453c