Commits · fcfaede87b7a26900be1646620e58fa237d77c5c · wangsen / MinerU

25 Nov, 2024 7 commits
- Update bug_report.yml · fcfaede8
  Xiaomeng Zhao authored Nov 25, 2024
  
  fcfaede8
- Update version.py with new version · 9d6be7c9
  myhloli authored Nov 25, 2024
  
  9d6be7c9
- Merge pull request #1076 from opendatalab/release-0.10.1 · 4dcf31b6
  Xiaomeng Zhao authored Nov 25, 2024
```
Release 0.10.1
```
  4dcf31b6
- Merge pull request #1075 from myhloli/dev · 4f13c282
  Xiaomeng Zhao authored Nov 25, 2024
```
test: batch process demo PDFs- Update test block to iterate through multiple demo PDF files
```
  4f13c282
- test: batch process demo PDFs- Update test block to iterate through multiple demo PDF files · e11e6b32
  myhloli authored Nov 25, 2024
```
- Use os.path.join to construct file paths for better cross-platform compatibility
- Remove hardcoded file path
```
  e11e6b32
- Merge pull request #1074 from myhloli/dev · ea94a35b
  Xiaomeng Zhao authored Nov 25, 2024
```
feat(demo): add visualization bbox parameter and refactor parsing process
```
  ea94a35b
- feat(demo): add visualization bbox parameter and refactor parsing process · 17ef5c0f
  myhloli authored Nov 25, 2024
```
- Add is_draw_visualization_bbox parameter to enable/disable visualization of bounding boxes
- Refactor the parsing process to improve code readability and maintainability
- Update function documentation to reflect new parameter
- Simplify test code by using a more generic variable name
```
  17ef5c0f
24 Nov, 2024 4 commits
- Merge pull request #1071 from icecraft/fix/demo · 29b38d12
  Xiaomeng Zhao authored Nov 24, 2024
```
Fix/demo
```
  29b38d12
- fix: remove unused file · e9ace3eb
  icecraft authored Nov 24, 2024
  
  e9ace3eb
- fix: rewrite projects/ and demos with new data api · ae379e6b
  icecraft authored Nov 24, 2024
  
  ae379e6b
- fix: rewrite projects/ and demos with new data api · b1adde8e
  icecraft authored Nov 24, 2024
  
  b1adde8e
22 Nov, 2024 19 commits
- Merge pull request #1066 from opendatalab/master · 4e0b3a8f
  Xiaomeng Zhao authored Nov 22, 2024
```
master -> dev
```
  4e0b3a8f
- Update FAQ_en_us.md · dc37af0a
  Xiaomeng Zhao authored Nov 22, 2024
  
  dc37af0a
- Update FAQ_zh_cn.md · 6eabc682
  Xiaomeng Zhao authored Nov 22, 2024
  
  6eabc682
- Update version.py with new version · 0624b565
  myhloli authored Nov 22, 2024
  
  0624b565
- Merge pull request #1063 from opendatalab/release-0.10.0 · 158e556b
  Xiaomeng Zhao authored Nov 22, 2024
```
Release 0.10.0
```
  158e556b
- Merge pull request #1065 from opendatalab/dev · 30be5017
  Xiaomeng Zhao authored Nov 22, 2024
```
fix(pdf_parse): improve OCR result handling
```
  30be5017
- Merge pull request #1064 from myhloli/dev · b936cb0c
  Xiaomeng Zhao authored Nov 22, 2024
```
fix(pdf_parse): improve OCR result handling
```
  b936cb0c
- fix(pdf_parse): improve OCR result handling · 6b296ee2
  myhloli authored Nov 22, 2024
```
- Add null check for OCR results to prevent errors on empty lists
- Enhance robustness of OCR text processing in the magic-pdf project
```
  6b296ee2
- Merge pull request #1062 from opendatalab/dev · 809bf479
  Xiaomeng Zhao authored Nov 22, 2024
```
fix(table): add null check for OCR result in rapid table prediction 
```
  809bf479
- Merge pull request #1061 from myhloli/dev · 241d4895
  Xiaomeng Zhao authored Nov 22, 2024
```
refactor(model): move page total time logging to custom model analysis
```
  241d4895
- refactor(model): move page total time logging to custom model analysis · f1e2f084
  myhloli authored Nov 22, 2024
```
- Move page total time logging to doc_analyze_by_custom_model.py
- Remove page total time logging from pdf_extract_kit.py
- Add page_start timing variable to custom model analysis
- Update logger output format for page total time
```
  f1e2f084
- Merge pull request #1060 from myhloli/dev · 0d632833
  Xiaomeng Zhao authored Nov 22, 2024
```
fix(table): add null check for OCR result in rapid table prediction
```
  0d632833
- fix(table): add null check for OCR result in rapid table prediction · 18aa1a20
  myhloli authored Nov 22, 2024
```
- Add a null check for OCR result in the predict method
- Return None values if OCR result is None to prevent further processing
```
  18aa1a20
- Merge pull request #1059 from myhloli/dev · 958168b3
  Xiaomeng Zhao authored Nov 22, 2024
```
feat(README): update for v0.10.0 
```
  958168b3
- Merge remote-tracking branch 'origin/dev' into dev · c6627b68
  myhloli authored Nov 22, 2024
  
  c6627b68
- feat(README): update for v0.10.0 、 · d9cfdad1
  myhloli authored Nov 22, 2024
```
- Introduced hybrid OCR text extraction capabilities in v0.10.0
- Significantly improved parsing performance in complex text distribution scenarios- Combined advantages of accurate content extraction and faster speed in text mode with more precise span/line region recognition in OCR mode
- Updated both English and Chinese README files
```
  d9cfdad1
- Merge pull request #1058 from myhloli/dev · f70246d6
  Xiaomeng Zhao authored Nov 22, 2024
```
refactor(para): improve line stop flag and remove unused debug mode
```
  f70246d6
- refactor(para): improve line stop flag and remove unused debug mode · 5d6cbcb1
  myhloli authored Nov 22, 2024
```
- Add '-' and '–' to LINE_STOP_FLAG in pdf_parse_union_core_v2.py
- Remove unused debug_mode parameter from para_split function in para_split_v3.py
```
  5d6cbcb1
- Add test cases to json compressor util (#1056) · 93208f44
  Alex Liu authored Nov 22, 2024
```
* delete unused pipeline file

* add json test circle

* add size reduction test case

* add serializable test case

* add invalid json compress test case

* add empty test case

* add special char test case
```
  93208f44
21 Nov, 2024 10 commits
- Merge pull request #1054 from myhloli/dev · 5578d77c
  Xiaomeng Zhao authored Nov 22, 2024
```
test: comment out assertions for metascan classify and meta scan tests
```
  5578d77c
- test: comment out assertions for metascan classify and meta scan tests · e7f883f1
  myhloli authored Nov 22, 2024
```
- Commented out assertions in test_metascan_classify/test_classify.py
- Commented out assertions in test_metascan_classify/test_meta_scan.py
- This change affects multiple test cases across both test files
```
  e7f883f1
- Merge pull request #1053 from myhloli/dev · a9281f18
  Xiaomeng Zhao authored Nov 22, 2024
```
fix(pdf_parse): improve line stop flag detection accuracy
```
  a9281f18
- fix(pdf_parse): improve line stop flag detection accuracy · ae3b0a1e
  myhloli authored Nov 22, 2024
```
- Add an additional condition to the line stop flag check
- Ensure character is to the right of the span's left boundary
- This change helps reduce false positives in line stop detection
```
  ae3b0a1e
- Merge pull request #1052 from icecraft/fix/gradio_project_read · 9e4d6a45
  Xiaomeng Zhao authored Nov 21, 2024
```
fix: use concrete class instead of abstract class
```
  9e4d6a45
- fix: use concrete class instead of abstract class · fa3c453c
  icecraft authored Nov 21, 2024
  
  fa3c453c
- Merge pull request #1050 from myhloli/dev · ead2e670
  Xiaomeng Zhao authored Nov 21, 2024
```
refactor(txt_parse): improve text extraction accuracy with new algorithm
```
  ead2e670
- refactor(txt_parse): improve text extraction accuracy with new algorithm · 309be741
  myhloli authored Nov 21, 2024
```
- Implement new text extraction method (txt_spans_extract_v2) to enhance accuracy
- Add character filling in spans for better text reconstruction
- Introduce empty span handling using OCR for missed text
- Optimize span filtering and overlap removal
```
  309be741
- Merge pull request #1049 from myhloli/dev · 190e2231
  Xiaomeng Zhao authored Nov 21, 2024
```
feat(ocr): improve text detection and OCR accuracy 
```
  190e2231
- Merge remote-tracking branch 'origin/dev' into dev · e52bd023
  myhloli authored Nov 21, 2024
```
# Conflicts:
#	magic_pdf/model/pdf_extract_kit.py
```
  e52bd023