- 21 Nov, 2024 2 commits
-
-
myhloli authored
- Implement new text extraction method (txt_spans_extract_v2) to enhance accuracy - Add character filling in spans for better text reconstruction - Introduce empty span handling using OCR for missed text - Optimize span filtering and overlap removal
-
myhloli authored
- Update OCR utils to handle different box formats and improve angle calculation - Modify PDF extraction kit to support OCR option and optimize processing flow - Enhance PPOCR model to sort and filter detection boxes, improving text splitting accuracy
-
- 19 Nov, 2024 1 commit
-
-
icecraft authored
-
- 18 Nov, 2024 2 commits
- 15 Nov, 2024 1 commit
-
-
myhloli authored
-
- 08 Nov, 2024 2 commits
-
-
myhloli authored
- Integrate RapidOCR with RapidTable model for table recognition - Improve memory management for devices with <= 8GB VRAM - Update table recognition process to use RapidOCR for RapidTable - Add rapidocr-paddle dependency in setup.py
-
myhloli authored
- Add RapidTable model support for table recognition - Update table model configuration and initialization - Modify table recognition process to use RapidTable when specified - Add RapidTable dependency to setup.py
-
- 07 Nov, 2024 1 commit
-
-
myhloli authored
- Implement xycut algorithm to sort blocks when layoutreader fails - Add recursive_xy_cut function to perform the xycut algorithm- Update pdf_parse_union_core_v2.py to use xycut when layoutreader fails - Modify draw_bbox.py to handle cases where layoutreader fails to sort blocks
-
- 06 Nov, 2024 1 commit
-
-
myhloli authored
- Remove unused code for copying detection and recognition models - Simplify OCR model initialization using atom_model_manager - Delete unnecessary comments and empty lines
-
- 05 Nov, 2024 1 commit
-
-
myhloli authored
- Replace np.array with np.asarray for better performance - Add image color conversion from RGB to BGR using OpenCV
-
- 04 Nov, 2024 3 commits
-
-
myhloli authored
- Import 're' module for regular expression operations - Implement HTML minification for 'output_format=html' - Add 'minify_html' method to remove unnecessary whitespace and format HTML
-
myhloli authored
- Comment out an unused code block in the ppTableModel.py file - Improve code readability and maintainability by removing unnecessary code
-
myhloli authored
- Update StructTableModel to use the latest struct-eqtable library - Add support for HTML table extraction in PDF Extract Kit - Improve error handling and model initialization - Update dependencies in setup.py for struct-eqtable
-
- 28 Oct, 2024 5 commits
-
-
myhloli authored
- Remove import and usage of StructTableModel- Add support for TableMaster model- Update table model initialization logic to support TableMaster - Log error and exit if StructEqTable is selected, as it's under upgrade - Update README files to reflect changes in table parsing capabilities
-
icecraft authored
-
liukaiwen authored
-
liukaiwen authored
-
icecraft authored
-
- 25 Oct, 2024 4 commits
- 24 Oct, 2024 3 commits
- 23 Oct, 2024 1 commit
-
-
myhloli authored
- Add new layout model option: DocLayout-YOLO - Implement model initialization and prediction for DocLayout-YOLO - Update configuration options to include new model- Modify existing code to support both LayoutLMv3 and DocLayout-YOLO models - Update Gradio app to support more Custom Switch
-
- 17 Oct, 2024 2 commits
-
-
liukaiwen authored
-
myhloli authored
- Remove unused functions such as split_long_words, ocr_mk_mm_markdown_with_para, etc. - Simplify ocr_mk_markdown_with_para_core_v2 by removing unnecessary language detection and word splitting logic- Remove wordninja dependency from requirements - Update ocr_model_init to include additional parameters for OCR model configuration
-
- 14 Oct, 2024 1 commit
-
-
myhloli authored
- Add detection for list and index blocks in OCR processing- Implement merging of list and index blocks across pages - Update block types to include list and index categories - Adjust text merging logic to handle new block types - Modify layout drawing to distinguish list and index blocks
-
- 10 Oct, 2024 1 commit
-
-
myhloli authored
- Reintegrate para_split_v3 into the pdf_parse_union_core_v2 process - Add support for specifying page range in doc_analyze_by_custom_model - Implement garbage collection and memory cleaning after processing - Refine image loading from PDF, including handling out-of-range pages
-
- 08 Oct, 2024 4 commits
-
-
liukaiwen authored
-
icecraft authored
-
icecraft authored
-
myhloli authored
- Introduce a conditional memory cleanup step in the PDF extraction process - Assess available GPU memory before deciding to perform memory cleanup- Log the time taken for garbage collection when it occurs - This optimization helps to balance performance and resource utilization
-
- 06 Oct, 2024 1 commit
-
-
myhloli authored
- Enhance timing output precision to two decimal places for better readability- Calculate and log document analysis speed in pages per second - Optimize logging for YOLO and table recognition processes - Remove unnecessary comments and improve code efficiency
-
- 30 Sep, 2024 1 commit
-
-
myhloli authored
-
- 29 Sep, 2024 1 commit
-
-
myhloli authored
The clean_memory function has been removed from pdf_parse_union_core_v2.py due to it not being used. This change streamlines the code and prevents potential confusion regarding its purpose.
-
- 28 Sep, 2024 1 commit
-
-
myhloli authored
Adapt import statements in `pdf_parse_union_core_v2.py` to reflect the updated packagestructure, changing from the `magic_pdf.v3.helpers` module to the `magic_pdf.model.v3` module. This ensures compatibility with the revised directory layout.
-
- 20 Sep, 2024 1 commit
-
-
myhloli authored
-