- 27 Nov, 2024 6 commits
-
-
myhloli authored
- Remove command line and API code examples from README files - Add links to online documentation for command line and API usage - Update content to point users to the new locations for detailed information
-
myhloli authored
- Remove unused function `calculate_angle_degrees`- Refactor `calculate_is_angle` to use directly in OCR processing - Eliminate unnecessary loop index `idx` in OCR processing loops
-
myhloli authored
- Remove commented-out code in ocr_dict_merge.py - Improve imports and code organization in ocr_detect_all_bboxes.py - Delete unnecessary empty lines and improve code readability
-
myhloli authored
- Remove unused imports from commons.py - Delete unused functions related to AWS and S3 operations - Update import statements in other modules to reflect changes in commons.py - Remove redundant code and improve code readability
-
myhloli authored
-
Xiaomeng Zhao authored
fix: test_tools unittest
-
- 26 Nov, 2024 23 commits
-
-
Xiaomeng Zhao authored
perf(image_processing): reduce maximum image size for analysis
-
myhloli authored
- Decrease the maximum image size threshold from 9000 to 4500 pixels - This change aims to improve performance and reduce memory usage - Affects the custom model document analysis process
-
Xiaomeng Zhao authored
fix: test_rag
-
icecraft authored
-
icecraft authored
-
Xiaomeng Zhao authored
refactor: remove deprecated markdown_utils function
-
myhloli authored
-
Xiaomeng Zhao authored
test: Shield some failed test cases
-
myhloli authored
-
Xiaomeng Zhao authored
refactor(pre_proc): remove unused functions and simplify code
-
myhloli authored
- Remove unused imports and functions across multiple files - Simplify code by deleting unnecessary comments and empty lines - Update function signatures to match actual usage - Replace redundant code with more efficient alternatives
-
Xiaomeng Zhao authored
refactor(magic_pdf): remove unused functions and simplify code
-
myhloli authored
-
Xiaomeng Zhao authored
refactor(magic_pdf): remove unused functions and simplify code
-
myhloli authored
-
Xiaomeng Zhao authored
feat(pdf_parse): improve text extraction for vertical spans
-
myhloli authored
- Calculate median span height to identify vertical spans - Use PyMuPDF's 'dict' output to fill vertical spans with lines
-
Xiaomeng Zhao authored
test: comment out assertion in test_metascan_classify
-
myhloli authored
- Disable the assertion for bool_classify_by_text_layout to skip this test
-
Xiaomeng Zhao authored
feat(pdf_parse): add OCR score to span data
-
myhloli authored
- Add OCR score to span dictionary when OCR text is applied - Improve data integrity by including confidence score
-
Xiaomeng Zhao authored
feat(ocr): filter out low confidence ocr results
-
myhloli authored
- Add confidence score threshold to filter out low confidence OCR results - Improve OCR accuracy by ignoring less certain detections
-
- 25 Nov, 2024 11 commits
-
-
Xiaomeng Zhao authored
refactor(txt_spans_extract_v2): optimize span processing and OCR logic
-
myhloli authored
- Add checks for uppercase character start in the first span of a block
-
myhloli authored
- Optimize character sorting for accurate text assembly - Handle empty char scenarios to prevent errors - Remove unnecessary comments and improve code readability - Enhance OCR text content handling by removing low-confidence spans
-
myhloli authored
-
myhloli authored
-
myhloli authored
- Merge useful_spans and unuseful_spans handling - Simplify overlap ratio calculation and block type checking - Remove unnecessary span removal and re-addition
-
Xiaomeng Zhao authored
fix(pdf_parse): Move the logic for filling text content into spans before the discarded_block recognition to fix the issue of empty text blocks in discarded_block.
-
myhloli authored
fix(pdf_parse): Move the logic for filling text content into spans before the discarded_block recognition to fix the issue of empty text blocks in discarded_block.
-
Xiaomeng Zhao authored
master -> dev
-
myhloli authored
-
Xiaomeng Zhao authored
Release 0.10.1
-