refactor(pdf_parse): improve character spacing handling in PDF text extraction

- Update the logic for inserting spaces between characters- Consider the next character's position instead of the previous one - Adjust the spacing threshold to 25% of the average character width - Ignore spaces at the end of lines to prevent double spaces

refactor(pdf_parse): improve character spacing handling in PDF text extraction
- Update the logic for inserting spaces between characters- Consider the next character's position instead of the previous one - Adjust the spacing threshold to 25% of the average character width - Ignore spaces at the end of lines to prevent double spaces
7c5cdcd4 · myhloli · 88b909e2 · 7c5cdcd4
Commit 7c5cdcd4 authored Jan 02, 2025 by myhloli
Hide whitespace changes
Inline Side-by-side

Showing with 6 additions and 3 deletions

magic_pdf/pdf_parse_union_core_v2.py magic_pdf/pdf_parse_union_core_v2.py +6 -3

No files found.
--- a/magic_pdf/pdf_parse_union_core_v2.py
+++ b/magic_pdf/pdf_parse_union_core_v2.py
@@ -92,9 +92,12 @@ def chars_to_content(span):
        content = ''
        for char in span['chars']:
            # 如果下一个char的x0和上一个char的x1距离超过一个字符宽度，则需要在中间插入一个空格
-            if char['bbox'][0] - span['chars'][span['chars'].index(char) - 1]['bbox'][2] > char_avg_width:
+            char1 = char
-                content += ' '
+            char2 = span['chars'][span['chars'].index(char) + 1] if span['chars'].index(char) + 1 < len(span['chars']) else None
-            content += char['c']
+            if char2 and char2['bbox'][0] - char1['bbox'][2] > char_avg_width * 0.25  and char['c'] != ' ' and char2['c'] != ' ':
+                content += f"{char['c']} "
+            else:
+                content += char['c']
        content = __replace_ligatures(content)
        span['content'] = __replace_0xfffd(content)