- 24 Dec, 2024 1 commit
-
-
myhloli authored
- Add LLM-aided formula and text correction functionality - Update config reader to include LLM-aided settings - Create new LLM-aided processing module - Update main processing script to incorporate LLM-aided corrections - Modify download scripts to check for new config version
-
- 13 Dec, 2024 1 commit
-
-
myhloli authored
- Move ligature replacement function to pdf_parse_union_core_v2.py - Optimize ligature replacement using a more efficient approach - Modify text extraction flags to preserve ligatures in PDF content - Remove unnecessary function from ocr_mkcontent.py
-
- 12 Dec, 2024 1 commit
-
-
myhloli authored
- Add initial setup for layout detection - Implement conditional cropping for tall images - Skip cropping for wide images to improve performance - Reuse Image object across layout detection steps
-
- 11 Dec, 2024 14 commits
-
-
xu rui authored
-
myhloli authored
-
myhloli authored
- Updated the filename generation logic in the draw_bbox function - Removed the unnecessary '_line_sort' suffix from the output PDF filename
-
myhloli authored
- Remove unused import of ocr_model_init from magic_pdf.model.sub_modules.model_init - Keep existing functionality and structure intact
-
myhloli authored
- Implement image cropping and pasting technique to enhance layout detection - Adjust detected polygons to original image coordinates - Add comments for better code readability
-
xu rui authored
-
xu rui authored
-
xu rui authored
-
xu rui authored
-
xu rui authored
-
xu rui authored
-
xu rui authored
-
xu rui authored
-
icecraft authored
-
- 10 Dec, 2024 7 commits
-
-
myhloli authored
- Change import paths from paddleocr.ppocr to ppocr for utility functions - Update import paths for logging and utility modules in ppocr_273_mod.py- Modify import paths for tablemaster_paddle.py to use ppstructure instead of paddleocr.ppstructure
-
myhloli authored
- Replace MuPDF with pdfminer for detecting invalid characters in PDFs - Uncomment and update the detect_invalid_chars function to use pdfminer - Update the check_invalid_chars function in pdf_meta_scan.py to use the new implementation
-
myhloli authored
- Change import path for TableSystem from 'ppstructure.table.predict_table' to 'paddleocr.ppstructure.table.predict_table' - Change import path for init_args from 'ppstructure.utility' to 'paddleocr.ppstructure.utility'
-
myhloli authored
- Modify import paths for paddleocr utilities in ocr_utils.py and ppocr_273_mod.py - Change from `ppocr.utils.utility` to `paddleocr.ppocr.utils.utility` - Update related import statements in two files to reflect the new path
-
myhloli authored
- Remove commented-out call to clean_memory() function - This change simplifies the code by eliminating an unused code snippet
-
myhloli authored
- Import paddle module and disable its signal handler to prevent interference with other components - This change addresses potential conflicts between PaddlePaddle and other libraries or system signals
-
myhloli authored
- Remove the call to clean_memory() function from pdf_parse_union_core_v2.py - This change may affect memory usage and needs to be tested to ensure proper functionality
-
- 09 Dec, 2024 3 commits
- 07 Dec, 2024 2 commits
-
-
sawmice authored
-
myhloli authored
- In Chinese, Japanese, and Korean (CJK) languages, no space is needed for line breaks within paragraphs. - However, if an inline equation is at the end of a line, a space should be added to separate it from the following text. - This change improves the formatting of documents containing both CJK text and inline equations.
-
- 06 Dec, 2024 10 commits
-
-
myhloli authored
- Remove concurrency limit logic from app.py - Update model initialization process in various modules - Remove unused VRAM check for concurrency limit - Refactor OCR model initialization in pdf_extract_kit.py - Update txt_spans_extract_v2 function to use lang parameter instead of ocr_model
-
myhloli authored
- Remove usage of AtomModelSingleton for OCR model creation - Add ocr_model_init function to initialize OCR model - Update OCR model initialization in pdf_extract_kit.py and pdf_parse_union_core_v2.py - Modify txt_spans_extract_v2 function to accept ocr_model as a parameter - Update parse_page_core function to use ocr_model instead of lang for OCR processing
-
myhloli authored
- Add threading support for OCR model initialization - Modify AtomModelSingleton to handle thread-specific instances - Update PDFExtractKit and PDFParseUnionCoreV2 to use new thread-safe OCR initialization
-
myhloli authored
- Remove threading.Lock import and usage - Delete unused model initialization comments and code- Simplify OCR model initialization in both pdf_extract_kit.py and pdf_parse_union_core_v2.py
-
myhloli authored
- Remove usage of AtomModelSingleton for OCR model initialization- Use ocr_model_init function for creating OCR model instance - Update import statement to include ocr_model_init- Comment out old OCR model initialization code
-
myhloli authored
- Remove usage of AtomModelSingleton for OCR model initialization - Add import of ocr_model_init from model_init module - Update OCR model initialization process to use ocr_model_init function - Remove lock for OCR processing as it's no longer needed
-
myhloli authored
- Remove usage of ModelSingleton class - Initialize model directly using custom_model_init function - Add self._lock attribute to PDFExtractKit class for thread safety- Replace local lock with self._lock for OCR processing
-
myhloli authored
-
赵小蒙 authored
- Remove unnecessary threading.Lock in AtomModelSingleton - Add threading.Lock to CustomPEKModel for OCR processing - Simplify model initialization logic in AtomModelSingleton
-
myhloli authored
- Add condition to return existing model if already initialized - Improve efficiency by avoiding redundant model creation
-
- 05 Dec, 2024 1 commit
-
-
myhloli authored
- Introduce a lock to synchronize access to OCR model initialization- This change improves thread safety when multiple threads access the OCR model concurrently - The lock ensures that the OCR model is initialized only once, even in multi-threaded scenarios
-