refactor(pdf_check): improve character detection using PyMuPDF
- Replace pdfminer with PyMuPDF for character detection - Implement new method detect_invalid_chars_by_pymupdf - Update check_invalid_chars in pdf_meta_scan.py to use new method - Add __replace_0xfffd function in pdf_parse_union_core_v2.py to handle special characters - Remove unused imports and update requirements.txt
Showing
| ... | ... | @@ -4,10 +4,10 @@ click>=8.1.7 |
| fast-langdetect==0.2.0 | ||
| loguru>=0.6.0 | ||
| numpy>=1.21.6,<2.0.0 | ||
| pdfminer.six==20231228 | ||
| pydantic>=2.7.2,<2.8.0 | ||
| PyMuPDF>=1.24.9 | ||
| scikit-learn>=1.0.2 | ||
| torch>=2.2.2,<=2.3.1 | ||
| transformers | ||
| # pdfminer.six==20231228 | ||
| # The requirements.txt must ensure that only necessary external dependencies are introduced. If there are new dependencies to add, please contact the project administrator. |
Please register or sign in to comment