Commits · b2e37a2d1b3fa6bb9d2c368e57a1fe54dca5ebeb · wangsen / MinerU

21 Nov, 2024 5 commits

feat(ocr): improve text detection and OCR accuracy · b2e37a2d

myhloli authored Nov 21, 2024

- Update OCR utils to handle different box formats and improve angle calculation
- Modify PDF extraction kit to support OCR option and optimize processing flow
- Enhance PPOCR model to sort and filter detection boxes, improving text splitting accuracy

b2e37a2d

fix(remove_overlaps_min_spans): optimize overlap detection in OCR span list modification · e4810cee
myhloli authored Nov 21, 2024
```
- Improve logic to skip dropped spans in overlap detection
- Enhance efficiency by avoiding unnecessary comparisons
```
e4810cee
fix(ocr_mkcontent): improve hyphen handling at line ends · a07007e5
myhloli authored Nov 21, 2024
```
- fix the bug where hyphens in the middle of a line are being discarded
```
a07007e5

refactor(ocr_dict_merge): add threshold parameter for line merging · b9f78c9b

myhloli authored Nov 21, 2024

- Add threshold parameter to merge_spans_to_line function
- Make threshold configurable for y-axis overlap check
- Improve flexibility and accuracy of line merging algorithm

b9f78c9b

fix(tools): handle empty language string in common.py · 20ed0cd5

myhloli authored Nov 21, 2024

- Check if language string is empty and set it to None
- This prevents potential errors when an empty language string is passed

20ed0cd5

18 Nov, 2024 3 commits

refactor(para): adjust right margin threshold based on block width · 69805f4b

myhloli authored Nov 19, 2024

- Introduce a variable threshold for right margin based on block width
- Use 0.26 * block_weight for wider blocks (block_weight_radio >= 0.5)
- Use 0.36 * block_weight for narrower blocks- This change aims to improve paragraph splitting accuracy for different block widths

69805f4b

refactor(para): improve paragraph splitting logic · 517fbe5b

myhloli authored Nov 18, 2024

- Add page size information to blocks
- Calculate block width ratio relative to page width
- Adjust threshold for determining right side indentation
- Implement additional checks for merging blocks across pages
- Improve logic for identifying list structures

517fbe5b

feat(ocr): improve handling of angled text boxes · 4fd966eb

myhloli authored Nov 18, 2024

- Add calculate_is_angle function to detect angled text boxes
- Update update_det_boxes and merge_det_boxes functions to handle angled text boxes
- Modify angle detection logic in various parts of the code

4fd966eb

15 Nov, 2024 1 commit
- refactor(model): rename and restructure model modules · 08f46125
  myhloli authored Nov 15, 2024
  
  08f46125
14 Nov, 2024 1 commit

fix(parse_pipeline): Resolve post-processing exceptions caused by partial PDFs... · 918ed65b

myhloli authored Nov 14, 2024

fix(parse_pipeline): Resolve post-processing exceptions caused by partial PDFs due to file corruption or non-standard format by forcing a re-print.

918ed65b

13 Nov, 2024 1 commit
- fix(ocr_mkcontent): improve handling of single-character content · 2de1d0ef
  myhloli authored Nov 13, 2024
```
- Add digit check for single-character content to avoid adding unnecessary spaces
```
  2de1d0ef
11 Nov, 2024 1 commit
- 更新 para_split_v3.py · 220a24cd
  hyastar authored Nov 11, 2024
  
  220a24cd
08 Nov, 2024 5 commits

feat(table): add RapidOCR support for RapidTable model · fe2c2c0d

myhloli authored Nov 09, 2024

- Integrate RapidOCR with RapidTable model for table recognition
- Improve memory management for devices with <= 8GB VRAM
- Update table recognition process to use RapidOCR for RapidTable
- Add rapidocr-paddle dependency in setup.py

fe2c2c0d

refactor(table): update default table model to Rapid Table · e78edb19
myhloli authored Nov 08, 2024
```
- Change the default table model from TABLE_MASTER to RAPID_TABLE
```
e78edb19

feat(table): integrate RapidTable model for table recognition · 240fe99e

myhloli authored Nov 08, 2024

- Add RapidTable model support for table recognition
- Update table model configuration and initialization
- Modify table recognition process to use RapidTable when specified
- Add RapidTable dependency to setup.py

240fe99e

refactor(pdf_parse): adjust line count threshold for layoutreader · 5936684f

myhloli authored Nov 08, 2024

- Lower the line count threshold from 316 to 200 to ensure compatibility
- This change aims to prevent potential issues with layoutreader's maximum line support

5936684f

refactor(pdf_parse): adjust line count limit for layoutreader · 5468e56f
myhloli authored Nov 08, 2024
```
- Decrease the maximum line count from 512 to 316 for layoutreader
```
5468e56f

07 Nov, 2024 1 commit

feat(model): add xycut algorithm for block sorting · 7d5850e3

myhloli authored Nov 08, 2024

- Implement xycut algorithm to sort blocks when layoutreader fails
- Add recursive_xy_cut function to perform the xycut algorithm- Update pdf_parse_union_core_v2.py to use xycut when layoutreader fails
- Modify draw_bbox.py to handle cases where layoutreader fails to sort blocks

7d5850e3

06 Nov, 2024 1 commit

refactor(model): remove unused code and simplify OCR model initialization · 4b0f1176

myhloli authored Nov 06, 2024

- Remove unused code for copying detection and recognition models
- Simplify OCR model initialization using atom_model_manager
- Delete unnecessary comments and empty lines

4b0f1176

05 Nov, 2024 1 commit

fix(table): improve table image processing · 401dfa4e

myhloli authored Nov 05, 2024

- Replace np.array with np.asarray for better performance
- Add image color conversion from RGB to BGR using OpenCV

401dfa4e

04 Nov, 2024 4 commits

fix(merge_text): add ligature replacement functionality · bd755962

myhloli authored Nov 04, 2024

- Implement __replace_ligatures function to split ligature characters- Integrate ligature replacement into the merge_para_with_text function
- Handle common ligatures such as fi, fl, ff, ffi, and ffl

bd755962

feat(model): add HTML minification to StructTableModel · b5117e72

myhloli authored Nov 04, 2024

- Import 're' module for regular expression operations
- Implement HTML minification for 'output_format=html'
- Add 'minify_html' method to remove unnecessary whitespace and format HTML

b5117e72

refactor(model): comment out unused code in ppTableModel · 5ee02a99

myhloli authored Nov 04, 2024

- Comment out an unused code block in the ppTableModel.py file
- Improve code readability and maintainability by removing unnecessary code

5ee02a99

feat(table): upgrade StructEqTable model and integrate into PDF Extract Kit · 11f23843

myhloli authored Nov 04, 2024

- Update StructTableModel to use the latest struct-eqtable library
- Add support for HTML table extraction in PDF Extract Kit
- Improve error handling and model initialization
- Update dependencies in setup.py for struct-eqtable

11f23843

03 Nov, 2024 2 commits

fix(dict2md): improve text concatenation logic · 99cf160d

myhloli authored Nov 03, 2024

- Optimize content stripping and checking logic
- Add special case handling for single-character content
- Adjust spacing rules for different content types

99cf160d

feat(para_split_v3): improve list identification with block aspect ratio · cf0d76c0

myhloli authored Nov 03, 2024

- Add block_height calculation to determine block aspect ratio
- Update list identification condition to include aspect ratio check
- Improve code readability with better formatting and line breaks

cf0d76c0

02 Nov, 2024 2 commits

feat(list): improve list detection algorithm- Add center_close_num and... · 2bf6c268

myhloli authored Nov 03, 2024

feat(list): improve list detection algorithm- Add center_close_num and external_sides_not_close_num variables to analyze line positioning
- Implement new list detection condition for centered lines
- Enhance existing list detection logic with additional checks

2bf6c268

fix(list): improve list identification accuracy- Adjust the threshold for... · a8f2e7d6

myhloli authored Nov 03, 2024

fix(list): improve list identification accuracy- Adjust the threshold for determining right-side spacing to 0.26 * block_weight
- Add TODO comment for special list identification with all centered lines- Modify the condition for recognizing short item lists with left alignment
- Update the condition for identifying the end of a list item

a8f2e7d6

01 Nov, 2024 8 commits

fix(ocr): handle inline equations consistently with text content · 87b9eeee

myhloli authored Nov 01, 2024

- Include InlineEquation in the condition for handling text content
- Remove separate block for InlineEquation processing
- Ensures consistent handling of inline equations and text, improving content formatting

87b9eeee

fix(ocr_mkcontent): improve content handling for different languages and... · 7c03014c

myhloli authored Nov 01, 2024

fix(ocr_mkcontent): improve content handling for different languages and equation types- Adjust content formatting for Chinese, Japanese, Korean, and Western languages
- Implement proper spacing rules around inline equations- Remove unnecessary empty lines in paragraph text

7c03014c

feat(pdf_parse): improve span filtering and add new block types · 149132d6

myhloli authored Nov 01, 2024

- Refactor remove_outside_spans function to filter spans more accurately
- Add image_footnote, index, and list block types to output file documentation
- Update draw_span_bbox to use preproc_blocks instead of para_blocks
- Bump version to 0.9.0

149132d6

feat: add more unittest · 338c6814
icecraft authored Nov 01, 2024

338c6814
feat: add more docs about data releated api · 47db844c
xu rui authored Nov 01, 2024

47db844c

fix(pdf_parse): improve span removal logic for all content types · ad0d06b6

myhloli authored Nov 01, 2024

- Update remove_outside_spans function to handle all content types
- Add processing for text and equation spans
- Improve overlap calculation for better accuracy

ad0d06b6

fix(pdf_parse): improve span removal logic for all content types · 509128d5

myhloli authored Nov 01, 2024

- Update remove_outside_spans function to handle all content types
- Add processing for text and equation spans
- Improve overlap calculation for better accuracy

509128d5

fix(pdf_parse): improve span removal logic for all content types · eeda90af

myhloli authored Nov 01, 2024

- Update remove_outside_spans function to handle all content types
- Add processing for text and equation spans
- Improve overlap calculation for better accuracy

eeda90af

31 Oct, 2024 1 commit

fix(pdf_parse): optimize span processing by removing outside spans · 6b9f816f

myhloli authored Oct 31, 2024

- Add new function `remove_outside_spans` to filter spans based on image and table blocks
- Reorder span processing steps to improve efficiency
- Update imports to include `calculate_overlap_area_in_bbox1_area_ratio`

6b9f816f

30 Oct, 2024 2 commits

fix(magic_pdf): handle missing image_path in spans · faf8c286

myhloli authored Oct 30, 2024

- Add check for 'image_path' in spans to avoid errors when it's missing
- Update image handling in both paragraph text and content dictionary
- Improve error handling and make the code more robust

faf8c286

fix(ocr): improve image and table content extraction · b7e9d454

myhloli authored Oct 30, 2024

- Update image content extraction to iterate through all spans in a block
- Add support for extracting table content from spans within a block
- Handle multiple content types within table spans (latex, html, image)
- Refactor code to be more modular and easier to maintain

b7e9d454

28 Oct, 2024 1 commit

refactor(table): disable StructEqTable support and add TableMaster support · 377b09cf

myhloli authored Oct 28, 2024

- Remove import and usage of StructTableModel- Add support for TableMaster model- Update table model initialization logic to support TableMaster
- Log error and exit if StructEqTable is selected, as it's under upgrade
- Update README files to reflect changes in table parsing capabilities

377b09cf