fix(pre_proc): improve character overlap handling in OCR processing

- Add condition to check for identical or space characters when resolving overlaps - Skip non-conflicting character pairs to prevent unnecessary removals

fix(pre_proc): improve character overlap handling in OCR processing
- Add condition to check for identical or space characters when resolving overlaps - Skip non-conflicting character pairs to prevent unnecessary removals
be505a95 · myhloli · 59e99fcf · be505a95
Commit be505a95 authored Mar 25, 2025 by myhloli
Hide whitespace changes
Inline Side-by-side

Showing with 10 additions and 8 deletions

magic_pdf/pre_proc/ocr_span_list_modify.py magic_pdf/pre_proc/ocr_span_list_modify.py +10 -8

No files found.
--- a/magic_pdf/pre_proc/ocr_span_list_modify.py
+++ b/magic_pdf/pre_proc/ocr_span_list_modify.py
@@ -71,15 +71,17 @@ def remove_x_overlapping_chars(span, median_width):
            overlap_width = x_right - x_left
            if overlap_width > overlap_threshold:
-                # Determine which character to remove
+                if char1['c'] == char2['c'] or char1['c'] == ' ' or char2['c'] == ' ':
-                width1 = char1['bbox'][2] - char1['bbox'][0]
+                    # Determine which character to remove
-                width2 = char2['bbox'][2] - char2['bbox'][0]
+                    width1 = char1['bbox'][2] - char1['bbox'][0]
+                    width2 = char2['bbox'][2] - char2['bbox'][0]
-                if width1 < width2:
+                    if width1 < width2:
-                    # Remove the narrower character
+                        # Remove the narrower character
-                    span['chars'].pop(i)
+                        span['chars'].pop(i)
+                    else:
+                        span['chars'].pop(i + 1)
                else:
-                    span['chars'].pop(i + 1)
+                    i += 1
                # Don't increment i since we need to check the new pair
            else: