- 09 Jan, 2025 1 commit
-
-
myhloli authored
- Improve language detection by removing newline characters from the input text - Add error handling and fallback mechanism to deal with text containing control characters
-
- 05 Jan, 2025 1 commit
-
-
myhloli authored
- Add `draw_char_bbox` function to `draw_bbox.py` for drawing character bounding boxes - Integrate `draw_char_bbox` into `common.py` for use in PDF processing pipeline - Include option to draw character bounding boxes in debug mode
-
- 30 Dec, 2024 1 commit
-
-
myhloli authored
- Update `clean_memory.py` to use `torch_npu.npu` instead of `torch.npu` - Update `model_utils.py` to use `torch_npu.npu` instead of `torch.npu` - Simplify NPU availability check and bfloat16 support in `pdf_parse_union_core_v2.py`
-
- 26 Dec, 2024 2 commits
-
-
myhloli authored
- Update clean_memory function to support both CUDA and NPU devices - Implement get_device function to centralize device selection logic - Modify model initialization and memory cleaning to use the selected device - Update RapidTableModel to support both RapidOCR and PaddleOCR engines
-
myhloli authored
- Add NPU support for memory cleaning and model initialization - Optimize table model initialization and prediction process - Update memory utils to support NPU - Add language parameter for table model
-
- 24 Dec, 2024 1 commit
-
-
myhloli authored
- Add LLM-aided formula and text correction functionality - Update config reader to include LLM-aided settings - Create new LLM-aided processing module - Update main processing script to incorporate LLM-aided corrections - Modify download scripts to check for new config version
-
- 11 Dec, 2024 2 commits
- 10 Dec, 2024 1 commit
-
-
myhloli authored
- Replace MuPDF with pdfminer for detecting invalid characters in PDFs - Uncomment and update the detect_invalid_chars function to use pdfminer - Update the check_invalid_chars function in pdf_meta_scan.py to use the new implementation
-
- 03 Dec, 2024 2 commits
- 02 Dec, 2024 1 commit
-
-
myhloli authored
-
- 29 Nov, 2024 2 commits
- 28 Nov, 2024 1 commit
-
-
myhloli authored
- Replace pdfminer with PyMuPDF for character detection - Implement new method detect_invalid_chars_by_pymupdf - Update check_invalid_chars in pdf_meta_scan.py to use new method - Add __replace_0xfffd function in pdf_parse_union_core_v2.py to handle special characters - Remove unused imports and update requirements.txt
-
- 27 Nov, 2024 2 commits
- 26 Nov, 2024 3 commits
- 25 Nov, 2024 1 commit
-
-
myhloli authored
-
- 22 Nov, 2024 1 commit
-
-
myhloli authored
-
- 21 Nov, 2024 1 commit
-
-
myhloli authored
- Implement new text extraction method (txt_spans_extract_v2) to enhance accuracy - Add character filling in spans for better text reconstruction - Introduce empty span handling using OCR for missed text - Optimize span filtering and overlap removal
-
- 19 Nov, 2024 1 commit
-
-
icecraft authored
-
- 18 Nov, 2024 1 commit
-
-
icecraft authored
-
- 15 Nov, 2024 1 commit
-
-
myhloli authored
-
- 08 Nov, 2024 2 commits
-
-
myhloli authored
- Change the default table model from TABLE_MASTER to RAPID_TABLE
-
myhloli authored
- Add RapidTable model support for table recognition - Update table model configuration and initialization - Modify table recognition process to use RapidTable when specified - Add RapidTable dependency to setup.py
-
- 07 Nov, 2024 1 commit
-
-
myhloli authored
- Implement xycut algorithm to sort blocks when layoutreader fails - Add recursive_xy_cut function to perform the xycut algorithm- Update pdf_parse_union_core_v2.py to use xycut when layoutreader fails - Modify draw_bbox.py to handle cases where layoutreader fails to sort blocks
-
- 06 Nov, 2024 2 commits
- 01 Nov, 2024 1 commit
-
-
myhloli authored
- Refactor remove_outside_spans function to filter spans more accurately - Add image_footnote, index, and list block types to output file documentation - Update draw_span_bbox to use preproc_blocks instead of para_blocks - Bump version to 0.9.0
-
- 28 Oct, 2024 1 commit
-
-
liukaiwen authored
-
- 26 Oct, 2024 1 commit
-
-
myhloli authored
- Add support for drawing bounding boxes of table and image sub-blocks - Implement sorting of table blocks based on type order - Update bounding box drawing for text and title blocks - Refactor code to handle different block types and their sub-blocks
-
- 24 Oct, 2024 1 commit
-
-
icecraft authored
feat: add Data api
-
- 23 Oct, 2024 1 commit
-
-
myhloli authored
- Add new layout model option: DocLayout-YOLO - Implement model initialization and prediction for DocLayout-YOLO - Update configuration options to include new model- Modify existing code to support both LayoutLMv3 and DocLayout-YOLO models - Update Gradio app to support more Custom Switch
-
- 17 Oct, 2024 1 commit
-
-
liukaiwen authored
-
- 14 Oct, 2024 2 commits
-
-
myhloli authored
Add List and Index to the list of block types being processed in the draw_bbox.py file. This inclusion ensures that these block types are handled similarly to other text-containing blocks, improving the overall document processing accuracy and consistency.
-
myhloli authored
- Add detection for list and index blocks in OCR processing- Implement merging of list and index blocks across pages - Update block types to include list and index categories - Adjust text merging logic to handle new block types - Modify layout drawing to distinguish list and index blocks
-
- 08 Oct, 2024 1 commit
-
-
myhloli authored
- Add function to get local LayoutReader model directory- Check and use local model directory if available - Fall back to online model if local directory not found - Update model initialization to support local path - Refactor model loading in singleton class
-