Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
wangsen
MinerU
Commits
51b8c57d
Commit
51b8c57d
authored
Dec 19, 2024
by
pangguosheng
Browse files
fix: skip the char corresponding to invalid bounding boxes
parent
b71993a9
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
4 additions
and
0 deletions
+4
-0
magic_pdf/pdf_parse_union_core_v2.py
magic_pdf/pdf_parse_union_core_v2.py
+4
-0
No files found.
magic_pdf/pdf_parse_union_core_v2.py
View file @
51b8c57d
...
@@ -108,6 +108,10 @@ def fill_char_in_spans(spans, all_chars):
...
@@ -108,6 +108,10 @@ def fill_char_in_spans(spans, all_chars):
spans
=
sorted
(
spans
,
key
=
lambda
x
:
x
[
'bbox'
][
1
])
spans
=
sorted
(
spans
,
key
=
lambda
x
:
x
[
'bbox'
][
1
])
for
char
in
all_chars
:
for
char
in
all_chars
:
# 跳过非法bbox的char
x1
,
y1
,
x2
,
y2
=
char
[
'bbox'
]
if
abs
(
x1
-
x2
)
<=
0.01
or
abs
(
y1
-
y2
)
<=
0.01
:
continue
for
span
in
spans
:
for
span
in
spans
:
if
calculate_char_in_span
(
char
[
'bbox'
],
span
[
'bbox'
],
char
[
'c'
]):
if
calculate_char_in_span
(
char
[
'bbox'
],
span
[
'bbox'
],
char
[
'c'
]):
span
[
'chars'
].
append
(
char
)
span
[
'chars'
].
append
(
char
)
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment