magic_pdf/libs/pdf_check.py · ac88815620933dba8d657f68b59b07199509dcd3 · wangsen / MinerU

refactor(pdf_check): improve character detection using PyMuPDF · ac888156

myhloli authored Nov 28, 2024

- Replace pdfminer with PyMuPDF for character detection
- Implement new method detect_invalid_chars_by_pymupdf
- Update check_invalid_chars in pdf_meta_scan.py to use new method
- Add __replace_0xfffd function in pdf_parse_union_core_v2.py to handle special characters
- Remove unused imports and update requirements.txt

ac888156

pdf_check.py 3.1 KB

Replace pdf_check.py