"docker/Dockerfile.sagemaker" did not exist on "8832ecb1e451a58a85cbdcd7029586187c1c9574"
- 27 Nov, 2024 5 commits
-
-
myhloli authored
-
myhloli authored
-
myhloli authored
- Remove unused function `calculate_angle_degrees`- Refactor `calculate_is_angle` to use directly in OCR processing - Eliminate unnecessary loop index `idx` in OCR processing loops
-
myhloli authored
- Remove commented-out code in ocr_dict_merge.py - Improve imports and code organization in ocr_detect_all_bboxes.py - Delete unnecessary empty lines and improve code readability
-
myhloli authored
- Remove unused imports from commons.py - Delete unused functions related to AWS and S3 operations - Update import statements in other modules to reflect changes in commons.py - Remove redundant code and improve code readability
-
- 26 Nov, 2024 8 commits
-
-
myhloli authored
- Decrease the maximum image size threshold from 9000 to 4500 pixels - This change aims to improve performance and reduce memory usage - Affects the custom model document analysis process
-
myhloli authored
-
myhloli authored
- Remove unused imports and functions across multiple files - Simplify code by deleting unnecessary comments and empty lines - Update function signatures to match actual usage - Replace redundant code with more efficient alternatives
-
myhloli authored
-
myhloli authored
-
myhloli authored
- Calculate median span height to identify vertical spans - Use PyMuPDF's 'dict' output to fill vertical spans with lines
-
myhloli authored
- Add OCR score to span dictionary when OCR text is applied - Improve data integrity by including confidence score
-
myhloli authored
- Add confidence score threshold to filter out low confidence OCR results - Improve OCR accuracy by ignoring less certain detections
-
- 25 Nov, 2024 7 commits
-
-
myhloli authored
- Add checks for uppercase character start in the first span of a block
-
myhloli authored
- Optimize character sorting for accurate text assembly - Handle empty char scenarios to prevent errors - Remove unnecessary comments and improve code readability - Enhance OCR text content handling by removing low-confidence spans
-
myhloli authored
-
myhloli authored
-
myhloli authored
- Merge useful_spans and unuseful_spans handling - Simplify overlap ratio calculation and block type checking - Remove unnecessary span removal and re-addition
-
myhloli authored
fix(pdf_parse): Move the logic for filling text content into spans before the discarded_block recognition to fix the issue of empty text blocks in discarded_block.
-
myhloli authored
-
- 24 Nov, 2024 2 commits
- 22 Nov, 2024 5 commits
-
-
myhloli authored
-
myhloli authored
- Add null check for OCR results to prevent errors on empty lists - Enhance robustness of OCR text processing in the magic-pdf project
-
myhloli authored
- Move page total time logging to doc_analyze_by_custom_model.py - Remove page total time logging from pdf_extract_kit.py - Add page_start timing variable to custom model analysis - Update logger output format for page total time
-
myhloli authored
- Add a null check for OCR result in the predict method - Return None values if OCR result is None to prevent further processing
-
myhloli authored
- Add '-' and '–' to LINE_STOP_FLAG in pdf_parse_union_core_v2.py - Remove unused debug_mode parameter from para_split function in para_split_v3.py
-
- 21 Nov, 2024 7 commits
-
-
myhloli authored
- Add an additional condition to the line stop flag check - Ensure character is to the right of the span's left boundary - This change helps reduce false positives in line stop detection
-
myhloli authored
- Implement new text extraction method (txt_spans_extract_v2) to enhance accuracy - Add character filling in spans for better text reconstruction - Introduce empty span handling using OCR for missed text - Optimize span filtering and overlap removal
-
myhloli authored
- Update OCR utils to handle different box formats and improve angle calculation - Modify PDF extraction kit to support OCR option and optimize processing flow - Enhance PPOCR model to sort and filter detection boxes, improving text splitting accuracy
-
myhloli authored
- Improve logic to skip dropped spans in overlap detection - Enhance efficiency by avoiding unnecessary comparisons
-
myhloli authored
- fix the bug where hyphens in the middle of a line are being discarded
-
myhloli authored
- Add threshold parameter to merge_spans_to_line function - Make threshold configurable for y-axis overlap check - Improve flexibility and accuracy of line merging algorithm
-
myhloli authored
- Check if language string is empty and set it to None - This prevents potential errors when an empty language string is passed
-
- 20 Nov, 2024 1 commit
-
-
icecraft authored
-
- 19 Nov, 2024 2 commits
- 18 Nov, 2024 3 commits
-
-
myhloli authored
- Introduce a variable threshold for right margin based on block width - Use 0.26 * block_weight for wider blocks (block_weight_radio >= 0.5) - Use 0.36 * block_weight for narrower blocks- This change aims to improve paragraph splitting accuracy for different block widths
-
myhloli authored
- Add page size information to blocks - Calculate block width ratio relative to page width - Adjust threshold for determining right side indentation - Implement additional checks for merging blocks across pages - Improve logic for identifying list structures
-
myhloli authored
- Add calculate_is_angle function to detect angled text boxes - Update update_det_boxes and merge_det_boxes functions to handle angled text boxes - Modify angle detection logic in various parts of the code
-