Commit 0b35b73c authored by 赵小蒙's avatar 赵小蒙
Browse files

删除高度或者宽度为0的spans

parent 9bd6294b
...@@ -156,6 +156,9 @@ def parse_pdf_by_ocr( ...@@ -156,6 +156,9 @@ def parse_pdf_by_ocr(
int(x1 / horizontal_scale_ratio), int(x1 / horizontal_scale_ratio),
int(y1 / vertical_scale_ratio), int(y1 / vertical_scale_ratio),
] ]
# 删除高度或者宽度为0的spans
if bbox[2] - bbox[0] == 0 or bbox[3] - bbox[1] == 0:
continue
"""要删除的""" """要删除的"""
# 3: 'header', # 页眉 # 3: 'header', # 页眉
# 4: 'page number', # 页码 # 4: 'page number', # 页码
...@@ -193,6 +196,9 @@ def parse_pdf_by_ocr( ...@@ -193,6 +196,9 @@ def parse_pdf_by_ocr(
else: else:
continue continue
# 删除重叠spans中较小的那些 # 删除重叠spans中较小的那些
spans = remove_overlaps_min_spans(spans) spans = remove_overlaps_min_spans(spans)
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment