- 22 Nov, 2024 1 commit
-
-
myhloli authored
- Add '-' and '–' to LINE_STOP_FLAG in pdf_parse_union_core_v2.py - Remove unused debug_mode parameter from para_split function in para_split_v3.py
-
- 21 Nov, 2024 19 commits
-
-
myhloli authored
- Commented out assertions in test_metascan_classify/test_classify.py - Commented out assertions in test_metascan_classify/test_meta_scan.py - This change affects multiple test cases across both test files
-
myhloli authored
- Add an additional condition to the line stop flag check - Ensure character is to the right of the span's left boundary - This change helps reduce false positives in line stop detection
-
Xiaomeng Zhao authored
fix: use concrete class instead of abstract class
-
icecraft authored
-
Xiaomeng Zhao authored
refactor(txt_parse): improve text extraction accuracy with new algorithm
-
myhloli authored
- Implement new text extraction method (txt_spans_extract_v2) to enhance accuracy - Add character filling in spans for better text reconstruction - Introduce empty span handling using OCR for missed text - Optimize span filtering and overlap removal
-
Xiaomeng Zhao authored
feat(ocr): improve text detection and OCR accuracy
-
myhloli authored
# Conflicts: # magic_pdf/model/pdf_extract_kit.py
-
myhloli authored
- Update OCR utils to handle different box formats and improve angle calculation - Modify PDF extraction kit to support OCR option and optimize processing flow - Enhance PPOCR model to sort and filter detection boxes, improving text splitting accuracy
-
Xiaomeng Zhao authored
fix(remove_overlaps_min_spans): optimize overlap detection in OCR span list modification
-
Xiaomeng Zhao authored
-
myhloli authored
- Improve logic to skip dropped spans in overlap detection - Enhance efficiency by avoiding unnecessary comparisons
-
Xiaomeng Zhao authored
fix(ocr_mkcontent): improve hyphen handling at line ends
-
myhloli authored
- fix the bug where hyphens in the middle of a line are being discarded
-
Xiaomeng Zhao authored
refactor(ocr_dict_merge): add threshold parameter for line merging
-
myhloli authored
- Add threshold parameter to merge_spans_to_line function - Make threshold configurable for y-axis overlap check - Improve flexibility and accuracy of line merging algorithm
-
Xiaomeng Zhao authored
fix(tools): handle empty language string in common.py
-
myhloli authored
- Check if language string is empty and set it to None - This prevents potential errors when an empty language string is passed
-
Xiaomeng Zhao authored
-
- 20 Nov, 2024 1 commit
-
-
icecraft authored
-
- 19 Nov, 2024 4 commits
-
-
Xiaomeng Zhao authored
refactor: move some constants or enums defs to config folder
-
icecraft authored
-
Alex Liu authored
-
Xiaomeng Zhao authored
fix: using new data api replace old rw api
-
- 18 Nov, 2024 15 commits
-
-
Xiaomeng Zhao authored
refactor(para): adjust right margin threshold based on block width
-
myhloli authored
- Introduce a variable threshold for right margin based on block width - Use 0.26 * block_weight for wider blocks (block_weight_radio >= 0.5) - Use 0.36 * block_weight for narrower blocks- This change aims to improve paragraph splitting accuracy for different block widths
-
Xiaomeng Zhao authored
build(setup): add old_linux specific dependencies
-
myhloli authored
- Add albumentations package with version <=1.4.20 for old_linux - This version is compatible with Linux systems from 2019 and earlier - Version 1.4.21 and above introduced simsimd which is not supported on older Linux systems
-
Xiaomeng Zhao authored
refactor(para): improve paragraph splitting logic
-
myhloli authored
- Add page size information to blocks - Calculate block width ratio relative to page width - Adjust threshold for determining right side indentation - Implement additional checks for merging blocks across pages - Improve logic for identifying list structures
-
Xiaomeng Zhao authored
feat(ocr): improve handling of angled text boxes
-
myhloli authored
- Add calculate_is_angle function to detect angled text boxes - Update update_det_boxes and merge_det_boxes functions to handle angled text boxes - Modify angle detection logic in various parts of the code
-
icecraft authored
-
Xiaomeng Zhao authored
refactor(tests): extract common test utilities into test_commons.py
-
myhloli authored
-
Xiaomeng Zhao authored
test(unitest): Restore unit test cases
-
myhloli authored
-
Xiaomeng Zhao authored
update ci
-
quyuan authored
-