Commits · 309be741e83b4cef5f1e05c834e83a9dd252e1ea · wangsen / MinerU

21 Nov, 2024 2 commits

refactor(txt_parse): improve text extraction accuracy with new algorithm · 309be741

myhloli authored Nov 21, 2024

- Implement new text extraction method (txt_spans_extract_v2) to enhance accuracy
- Add character filling in spans for better text reconstruction
- Introduce empty span handling using OCR for missed text
- Optimize span filtering and overlap removal

309be741

feat(ocr): improve text detection and OCR accuracy · b2e37a2d

myhloli authored Nov 21, 2024

- Update OCR utils to handle different box formats and improve angle calculation
- Modify PDF extraction kit to support OCR option and optimize processing flow
- Enhance PPOCR model to sort and filter detection boxes, improving text splitting accuracy

b2e37a2d

19 Nov, 2024 1 commit
- refactor: move some constants or enums defs to config folder · b492c19c
  icecraft authored Nov 19, 2024
  
  b492c19c
18 Nov, 2024 2 commits

feat(ocr): improve handling of angled text boxes · 4fd966eb

myhloli authored Nov 18, 2024

- Add calculate_is_angle function to detect angled text boxes
- Update update_det_boxes and merge_det_boxes functions to handle angled text boxes
- Modify angle detection logic in various parts of the code

4fd966eb

fix: using new data api replace old rw api · 6a481320
icecraft authored Nov 18, 2024

6a481320

15 Nov, 2024 1 commit
- refactor(model): rename and restructure model modules · 08f46125
  myhloli authored Nov 15, 2024
  
  08f46125
08 Nov, 2024 2 commits

feat(table): add RapidOCR support for RapidTable model · fe2c2c0d

myhloli authored Nov 09, 2024

- Integrate RapidOCR with RapidTable model for table recognition
- Improve memory management for devices with <= 8GB VRAM
- Update table recognition process to use RapidOCR for RapidTable
- Add rapidocr-paddle dependency in setup.py

fe2c2c0d

feat(table): integrate RapidTable model for table recognition · 240fe99e

myhloli authored Nov 08, 2024

- Add RapidTable model support for table recognition
- Update table model configuration and initialization
- Modify table recognition process to use RapidTable when specified
- Add RapidTable dependency to setup.py

240fe99e

07 Nov, 2024 1 commit

feat(model): add xycut algorithm for block sorting · 7d5850e3

myhloli authored Nov 08, 2024

- Implement xycut algorithm to sort blocks when layoutreader fails
- Add recursive_xy_cut function to perform the xycut algorithm- Update pdf_parse_union_core_v2.py to use xycut when layoutreader fails
- Modify draw_bbox.py to handle cases where layoutreader fails to sort blocks

7d5850e3

06 Nov, 2024 1 commit

refactor(model): remove unused code and simplify OCR model initialization · 4b0f1176

myhloli authored Nov 06, 2024

- Remove unused code for copying detection and recognition models
- Simplify OCR model initialization using atom_model_manager
- Delete unnecessary comments and empty lines

4b0f1176

05 Nov, 2024 1 commit

fix(table): improve table image processing · 401dfa4e

myhloli authored Nov 05, 2024

- Replace np.array with np.asarray for better performance
- Add image color conversion from RGB to BGR using OpenCV

401dfa4e

04 Nov, 2024 3 commits

feat(model): add HTML minification to StructTableModel · b5117e72

myhloli authored Nov 04, 2024

- Import 're' module for regular expression operations
- Implement HTML minification for 'output_format=html'
- Add 'minify_html' method to remove unnecessary whitespace and format HTML

b5117e72

refactor(model): comment out unused code in ppTableModel · 5ee02a99

myhloli authored Nov 04, 2024

- Comment out an unused code block in the ppTableModel.py file
- Improve code readability and maintainability by removing unnecessary code

5ee02a99

feat(table): upgrade StructEqTable model and integrate into PDF Extract Kit · 11f23843

myhloli authored Nov 04, 2024

- Update StructTableModel to use the latest struct-eqtable library
- Add support for HTML table extraction in PDF Extract Kit
- Improve error handling and model initialization
- Update dependencies in setup.py for struct-eqtable

11f23843

28 Oct, 2024 5 commits
- refactor(table): disable StructEqTable support and add TableMaster support · 377b09cf
  myhloli authored Oct 28, 2024
```
- Remove import and usage of StructTableModel- Add support for TableMaster model- Update table model initialization logic to support TableMaster
- Log error and exit if StructEqTable is selected, as it's under upgrade
- Update README files to reflect changes in table parsing capabilities
```
  377b09cf
- fix: add priority match rule · 34a13a89
  icecraft authored Oct 28, 2024
  
  34a13a89
- perf: table model update with PP OCRv4 · 4949408c
  liukaiwen authored Oct 28, 2024
  
  4949408c
- feat: table model update with paddle recognition v4 · a0eff3be
  liukaiwen authored Oct 28, 2024
  
  a0eff3be
- fix: patter match algorithm · f09148b9
  icecraft authored Oct 28, 2024
  
  f09148b9
25 Oct, 2024 4 commits
- fix: uncorrect pair match · 969101dd
  icecraft authored Oct 25, 2024
  
  969101dd
- refactor(ocr): adjust OCR processing parameters · 1807126e
  myhloli authored Oct 25, 2024
```
- Lower the Y-axis overlap threshold for merging spans into lines from0.6 to 0.5
- Reduce the unclip ratio for OCR detection from 2.4 to 1.8
```
  1807126e
- feat: update return result · 2c60172b
  icecraft authored Oct 25, 2024
  
  2c60172b
- feat: update table match caption algorithm · 92579040
  icecraft authored Oct 25, 2024
  
  92579040
24 Oct, 2024 3 commits
- refactor(magic_pdf): adjust confidence threshold for DocLayout_YOLO model · ce72cf05
  myhloli authored Oct 24, 2024
```
- Changed the confidence threshold from0.15 to 0.25 in the DocLayout_YOLO model prediction
- This adjustment aims to improve the accuracy of layout detection by filtering out low-confidence predictions
```
  ce72cf05
- style: remove unsed log info · c200effc
  icecraft authored Oct 24, 2024
  
  c200effc
- feat: add [figure | table] match [caption | footnote] match algorithm v2 · 283b597a
  icecraft authored Oct 19, 2024
```
feat: add Data api
```
  283b597a
23 Oct, 2024 1 commit

feat(model): add support for DocLayout-YOLO model · 1279f2cd

myhloli authored Oct 23, 2024

- Add new layout model option: DocLayout-YOLO
- Implement model initialization and prediction for DocLayout-YOLO
- Update configuration options to include new model- Modify existing code to support both LayoutLMv3 and DocLayout-YOLO models
- Update Gradio app to support more Custom Switch

1279f2cd

17 Oct, 2024 2 commits

feat: merge formula update · 51f56aa3
liukaiwen authored Oct 17, 2024

51f56aa3

refactor(ocr):Increase the dilation factor in OCR to address the issue of word concatenation. · 011a1b97

myhloli authored Oct 17, 2024

- Remove unused functions such as split_long_words, ocr_mk_mm_markdown_with_para, etc.
- Simplify ocr_mk_markdown_with_para_core_v2 by removing unnecessary language detection and word splitting logic- Remove wordninja dependency from requirements
- Update ocr_model_init to include additional parameters for OCR model configuration

011a1b97

14 Oct, 2024 1 commit

feat(list&index block): detect and merge list and index blocks · 1f1dd353

myhloli authored Oct 15, 2024

- Add detection for list and index blocks in OCR processing- Implement merging of list and index blocks across pages
- Update block types to include list and index categories
- Adjust text merging logic to handle new block types
- Modify layout drawing to distinguish list and index blocks

1f1dd353

10 Oct, 2024 1 commit

feat(pdf_parse_union_core_v2): reintegrate para_split_v3 and add page range support · 6f63e70e

myhloli authored Oct 10, 2024

- Reintegrate para_split_v3 into the pdf_parse_union_core_v2 process
- Add support for specifying page range in doc_analyze_by_custom_model
- Implement garbage collection and memory cleaning after processing
- Refine image loading from PDF, including handling out-of-range pages

6f63e70e

08 Oct, 2024 4 commits
- feat: merge formula update · a3358878
  liukaiwen authored Oct 08, 2024
  
  a3358878
- fix: caption|footnote match algorithm · f31433b8
  icecraft authored Oct 08, 2024
  
  f31433b8
- fix: caption or footnote match algorithm · ef45ad08
  icecraft authored Oct 08, 2024
  
  ef45ad08
- perf(pdf_extract_kit): conditional memory cleanup based on GPU capacity · fb9949c4
  myhloli authored Oct 08, 2024
```
- Introduce a conditional memory cleanup step in the PDF extraction process
- Assess available GPU memory before deciding to perform memory cleanup- Log the time taken for garbage collection when it occurs
- This optimization helps to balance performance and resource utilization
```
  fb9949c4
06 Oct, 2024 1 commit

refactor(model): improve timing information and performance · be1b1ae7

myhloli authored Oct 06, 2024

- Enhance timing output precision to two decimal places for better readability- Calculate and log document analysis speed in pages per second
- Optimize logging for YOLO and table recognition processes
- Remove unnecessary comments and improve code efficiency

be1b1ae7

30 Sep, 2024 1 commit
- chore: remove useless files · fcf24242
  myhloli authored Sep 30, 2024
  
  fcf24242
29 Sep, 2024 1 commit

refactor(memory management): remove unused clean_memory function · 4c9bf8ab

myhloli authored Sep 29, 2024

The clean_memory function has been removed from pdf_parse_union_core_v2.py due to it not being used.
This change streamlines the code and prevents potential confusion regarding its purpose.

4c9bf8ab

28 Sep, 2024 1 commit

refactor(pdf_parse_union_core_v2): update import paths to use new package structure · 5522d0a3

myhloli authored Sep 28, 2024

Adapt import statements in `pdf_parse_union_core_v2.py` to reflect the updated packagestructure, changing from the `magic_pdf.v3.helpers` module to the `magic_pdf.model.v3`
module. This ensures compatibility with the revised directory layout.

5522d0a3

20 Sep, 2024 1 commit
- fix(pdf_extract_kit):change unimernet base -> small · f2a3a495
  myhloli authored Sep 20, 2024
  
  f2a3a495