- 30 Nov, 2024 1 commit
-
-
houlinfeng authored
-
- 29 Nov, 2024 14 commits
-
-
Xiaomeng Zhao authored
fix(mkcontent): optimize paragraph text merging and language detection
-
myhloli authored
- Remove overlap between bboxes for block separation - Sort bboxes by combined x and y coordinates for better layout handling - Comment out previous overlap removal function
-
myhloli authored
- Extract language detection to block level instead of line level - Improve logic for handling Chinese, Japanese, and Korean languages - Refactor code for better readability and performance - Optimize handling of hyphenated words at line ends
-
myhloli authored
-
myhloli authored
- Introduce language detection to determine line spacing based on language context - Implement different spacing rules for Chinese/Japanese/Korean and Western texts - Adjust span content handling based on detected language and span type
-
Xiaomeng Zhao authored
master->dev
-
myhloli authored
-
Xiaomeng Zhao authored
Release 0.10.3
-
Xiaomeng Zhao authored
refactor(ocr): Fix the error of paddleocr failing to initialize in a multi-threaded environment
-
myhloli authored
-
Xiaomeng Zhao authored
refactor(pdf_parse): adjust character-axis alignment algorithm
-
myhloli authored
- Introduce `span_height_radio` parameter to calculate_char_in_span function - Replace fixed ratio with dynamic ratio for character and span axis alignment - Improve flexibility and accuracy of character placement within spans
-
Xiaomeng Zhao authored
fix(ocr_mkcontent): handle empty paragraphs on pages
-
myhloli authored
- Add empty paragraph handling for pages with no content - Append an empty markdown object when a page has no paragraphs - Increment page number even if no content is present
-
- 28 Nov, 2024 14 commits
-
-
Xiaomeng Zhao authored
feat(pdf_parse): add line start flag detection and optimize line stop flag logic
-
myhloli authored
- Add LINE_START_FLAG tuple to identify starting flags of a line - Modify calculate_char_in_span function to handle both line start and stop flags - Remove redundant char_is_line_stop_flag variable and simplify logic - Improve line flag detection to enhance text extraction accuracy
-
Xiaomeng Zhao authored
refactor(pdf_check): improve character detection using PyMuPDF
-
myhloli authored
- Replace pdfminer with PyMuPDF for character detection - Implement new method detect_invalid_chars_by_pymupdf - Update check_invalid_chars in pdf_meta_scan.py to use new method - Add __replace_0xfffd function in pdf_parse_union_core_v2.py to handle special characters - Remove unused imports and update requirements.txt
-
Xiaomeng Zhao authored
refactor(ocr): improve text processing and span handling
-
myhloli authored
- Remove unused language detection code - Simplify text content processing logic - Update span sorting and text extraction in pdf_parse_union_core_v2.py
-
Xiaomeng Zhao authored
feat(pdf_parse): filter out skewed text lines
-
myhloli authored
- Add direction filtering to ignore highly skewed text lines - Improve text extraction accuracy by focusing on non-skewed content
-
Xiaomeng Zhao authored
refactor(para): improve language detection and block splitting
-
myhloli authored
- Add language detection for each block of text - Implement language-specific logic for right margin alignment - Introduce logging for debugging purposes
-
Xiaomeng Zhao authored
fix(Hybrid OCR):Enable Hybrid OCR for Empty Spans That Contain a Certain Number of Placeholders but No Actual Text
-
myhloli authored
fix(Hybrid OCR):Enable Hybrid OCR for Empty Spans That Contain a Certain Number of Placeholders but No Actual Text
-
Xiaomeng Zhao authored
fix(lite_model): Adapt iite Mode to the Hybrid OCR Mode in Version 0.10
-
myhloli authored
-
- 27 Nov, 2024 11 commits
-
-
Xiaomeng Zhao authored
master -> dev
-
myhloli authored
-
Xiaomeng Zhao authored
Release 0.10.2
-
Xiaomeng Zhao authored
refactor(pdf_parse_union_core_v2): optimize page processing time logging
-
myhloli authored
-
Xiaomeng Zhao authored
Feat/add s3 read write example
-
xu rui authored
-
Xiaomeng Zhao authored
docs(README): remove code examples and redirect to documentation
-
myhloli authored
- Remove command line and API code examples from README files - Add links to online documentation for command line and API usage - Update content to point users to the new locations for detailed information
-
icecraft authored
-
Xiaomeng Zhao authored
refactor(ocr): remove unused functions and optimize OCR processing loop
-