- 03 Dec, 2024 2 commits
-
-
myhloli authored
- Update VRAM checking logic in app.py and model_utils.py - Add None and type checks for VRAM values - Adjust concurrency limit calculation in app.py - Modify clean_vram function to handle cases with no VRAM information
-
myhloli authored
- Add get_concurrency_limit function to calculate concurrency limit based on VRAM - Update clean_vram function and rename to get_vram for better clarity - Apply concurrency limit to the to_markdown function in the Gradio app
-
- 02 Dec, 2024 3 commits
-
-
myhloli authored
-
myhloli authored
- Decrease the maximum width and height from 9000 to 4500 pixels - This change aims to prevent excessive resource usage when rendering PDFs
-
myhloli authored
- Updated cut_image.py to check for NoneType imageWriter - Prevents AttributeError when imageWriter is not provided
-
- 30 Nov, 2024 1 commit
-
-
myhloli authored
- Decrease the line height multiplier from 0.8 to 0.7 for both left and right sides - This modification aims to improve the accuracy of paragraph splitting
-
- 29 Nov, 2024 8 commits
-
-
myhloli authored
-
myhloli authored
- Remove overlap between bboxes for block separation - Sort bboxes by combined x and y coordinates for better layout handling - Comment out previous overlap removal function
-
myhloli authored
- Extract language detection to block level instead of line level - Improve logic for handling Chinese, Japanese, and Korean languages - Refactor code for better readability and performance - Optimize handling of hyphenated words at line ends
-
myhloli authored
- Introduce language detection to determine line spacing based on language context - Implement different spacing rules for Chinese/Japanese/Korean and Western texts - Adjust span content handling based on detected language and span type
-
myhloli authored
-
myhloli authored
-
myhloli authored
- Introduce `span_height_radio` parameter to calculate_char_in_span function - Replace fixed ratio with dynamic ratio for character and span axis alignment - Improve flexibility and accuracy of character placement within spans
-
myhloli authored
- Add empty paragraph handling for pages with no content - Append an empty markdown object when a page has no paragraphs - Increment page number even if no content is present
-
- 28 Nov, 2024 7 commits
-
-
myhloli authored
- Add LINE_START_FLAG tuple to identify starting flags of a line - Modify calculate_char_in_span function to handle both line start and stop flags - Remove redundant char_is_line_stop_flag variable and simplify logic - Improve line flag detection to enhance text extraction accuracy
-
myhloli authored
- Replace pdfminer with PyMuPDF for character detection - Implement new method detect_invalid_chars_by_pymupdf - Update check_invalid_chars in pdf_meta_scan.py to use new method - Add __replace_0xfffd function in pdf_parse_union_core_v2.py to handle special characters - Remove unused imports and update requirements.txt
-
myhloli authored
- Remove unused language detection code - Simplify text content processing logic - Update span sorting and text extraction in pdf_parse_union_core_v2.py
-
myhloli authored
- Add direction filtering to ignore highly skewed text lines - Improve text extraction accuracy by focusing on non-skewed content
-
myhloli authored
- Add language detection for each block of text - Implement language-specific logic for right margin alignment - Introduce logging for debugging purposes
-
myhloli authored
fix(Hybrid OCR):Enable Hybrid OCR for Empty Spans That Contain a Certain Number of Placeholders but No Actual Text
-
myhloli authored
-
- 27 Nov, 2024 5 commits
-
-
myhloli authored
-
myhloli authored
-
myhloli authored
- Remove unused function `calculate_angle_degrees`- Refactor `calculate_is_angle` to use directly in OCR processing - Eliminate unnecessary loop index `idx` in OCR processing loops
-
myhloli authored
- Remove commented-out code in ocr_dict_merge.py - Improve imports and code organization in ocr_detect_all_bboxes.py - Delete unnecessary empty lines and improve code readability
-
myhloli authored
- Remove unused imports from commons.py - Delete unused functions related to AWS and S3 operations - Update import statements in other modules to reflect changes in commons.py - Remove redundant code and improve code readability
-
- 26 Nov, 2024 8 commits
-
-
myhloli authored
- Decrease the maximum image size threshold from 9000 to 4500 pixels - This change aims to improve performance and reduce memory usage - Affects the custom model document analysis process
-
myhloli authored
-
myhloli authored
- Remove unused imports and functions across multiple files - Simplify code by deleting unnecessary comments and empty lines - Update function signatures to match actual usage - Replace redundant code with more efficient alternatives
-
myhloli authored
-
myhloli authored
-
myhloli authored
- Calculate median span height to identify vertical spans - Use PyMuPDF's 'dict' output to fill vertical spans with lines
-
myhloli authored
- Add OCR score to span dictionary when OCR text is applied - Improve data integrity by including confidence score
-
myhloli authored
- Add confidence score threshold to filter out low confidence OCR results - Improve OCR accuracy by ignoring less certain detections
-
- 25 Nov, 2024 6 commits
-
-
myhloli authored
- Add checks for uppercase character start in the first span of a block
-
myhloli authored
- Optimize character sorting for accurate text assembly - Handle empty char scenarios to prevent errors - Remove unnecessary comments and improve code readability - Enhance OCR text content handling by removing low-confidence spans
-
myhloli authored
-
myhloli authored
-
myhloli authored
- Merge useful_spans and unuseful_spans handling - Simplify overlap ratio calculation and block type checking - Remove unnecessary span removal and re-addition
-
myhloli authored
fix(pdf_parse): Move the logic for filling text content into spans before the discarded_block recognition to fix the issue of empty text blocks in discarded_block.
-