feat(pdf_parse): filter out skewed text lines

- Add direction filtering to ignore highly skewed text lines - Improve text extraction accuracy by focusing on non-skewed content

feat(pdf_parse): filter out skewed text lines
- Add direction filtering to ignore highly skewed text lines - Improve text extraction accuracy by focusing on non-skewed content
37da8c44 · myhloli · f674b8d4 · 37da8c44
Commit 37da8c44 authored Nov 28, 2024 by myhloli
Show whitespace changes
Inline Side-by-side

Showing with 3 additions and 1 deletion

magic_pdf/pdf_parse_union_core_v2.py magic_pdf/pdf_parse_union_core_v2.py +3 -1

No files found.
--- a/magic_pdf/pdf_parse_union_core_v2.py
+++ b/magic_pdf/pdf_parse_union_core_v2.py
@@ -139,10 +139,12 @@ def txt_spans_extract_v2(pdf_page, spans, all_bboxes, all_discarded_blocks, lang
    text_blocks_raw = pdf_page.get_text('rawdict', flags=fitz.TEXTFLAGS_TEXT)['blocks']
-    # @todo: 拿到char之后把倾斜角度较大的先删一遍
    all_pymu_chars = []
    for block in text_blocks_raw:
        for line in block['lines']:
+            cosine, sine = line['dir']
+            if abs (cosine) < 0.9 or abs(sine) > 0.1:
+                continue
            for span in line['spans']:
                all_pymu_chars.extend(span['chars'])