Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
wangsen
MinerU
Commits
7964ae45
"git@developer.sourcefind.cn:OpenDAS/mmdetection3d.git" did not exist on "c0d009044af7f643f08cc1373925c6ce7bbd0fce"
Commit
7964ae45
authored
Nov 25, 2024
by
myhloli
Browse files
refactor(pdf_parse): improve code readability and maintainability
parent
97bcc8b2
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
22 additions
and
22 deletions
+22
-22
magic_pdf/pdf_parse_union_core_v2.py
magic_pdf/pdf_parse_union_core_v2.py
+22
-22
No files found.
magic_pdf/pdf_parse_union_core_v2.py
View file @
7964ae45
...
@@ -89,29 +89,29 @@ def __replace_STX_ETX(text_str: str):
...
@@ -89,29 +89,29 @@ def __replace_STX_ETX(text_str: str):
def
chars_to_content
(
span
):
def
chars_to_content
(
span
):
# # 先给chars按char['bbox']的x坐标排序
# # 先给chars按char['bbox']的x坐标排序
# span['chars'] = sorted(span['chars'], key=lambda x: x['bbox'][0])
# span['chars'] = sorted(span['chars'], key=lambda x: x['bbox'][0])
# 先给chars按char['bbox']的中心点的x坐标排序
# 先给chars按char['bbox']的中心点的x坐标排序
span
[
'chars'
]
=
sorted
(
span
[
'chars'
],
key
=
lambda
x
:
(
x
[
'bbox'
][
0
]
+
x
[
'bbox'
][
2
])
/
2
)
span
[
'chars'
]
=
sorted
(
span
[
'chars'
],
key
=
lambda
x
:
(
x
[
'bbox'
][
0
]
+
x
[
'bbox'
][
2
])
/
2
)
content
=
''
content
=
''
# 求char的平均宽度
# 求char的平均宽度
if
len
(
span
[
'chars'
])
==
0
:
if
len
(
span
[
'chars'
])
==
0
:
span
[
'content'
]
=
content
span
[
'content'
]
=
content
del
span
[
'chars'
]
return
else
:
char_width_sum
=
sum
([
char
[
'bbox'
][
2
]
-
char
[
'bbox'
][
0
]
for
char
in
span
[
'chars'
]])
char_avg_width
=
char_width_sum
/
len
(
span
[
'chars'
])
for
char
in
span
[
'chars'
]:
# 如果下一个char的x0和上一个char的x1距离超过一个字符宽度,则需要在中间插入一个空格
if
char
[
'bbox'
][
0
]
-
span
[
'chars'
][
span
[
'chars'
].
index
(
char
)
-
1
][
'bbox'
][
2
]
>
char_avg_width
:
content
+=
' '
content
+=
char
[
'c'
]
span
[
'content'
]
=
__replace_STX_ETX
(
content
)
del
span
[
'chars'
]
del
span
[
'chars'
]
return
else
:
char_width_sum
=
sum
([
char
[
'bbox'
][
2
]
-
char
[
'bbox'
][
0
]
for
char
in
span
[
'chars'
]])
char_avg_width
=
char_width_sum
/
len
(
span
[
'chars'
])
for
char
in
span
[
'chars'
]:
# 如果下一个char的x0和上一个char的x1距离超过一个字符宽度,则需要在中间插入一个空格
if
char
[
'bbox'
][
0
]
-
span
[
'chars'
][
span
[
'chars'
].
index
(
char
)
-
1
][
'bbox'
][
2
]
>
char_avg_width
:
content
+=
' '
content
+=
char
[
'c'
]
span
[
'content'
]
=
__replace_STX_ETX
(
content
)
del
span
[
'chars'
]
LINE_STOP_FLAG
=
(
'.'
,
'!'
,
'?'
,
'。'
,
'!'
,
'?'
,
')'
,
')'
,
'"'
,
'”'
,
':'
,
':'
,
';'
,
';'
,
']'
,
'】'
,
'}'
,
'}'
,
'>'
,
'》'
,
'、'
,
','
,
','
,
'-'
,
'—'
,
'–'
,)
LINE_STOP_FLAG
=
(
'.'
,
'!'
,
'?'
,
'。'
,
'!'
,
'?'
,
')'
,
')'
,
'"'
,
'”'
,
':'
,
':'
,
';'
,
';'
,
']'
,
'】'
,
'}'
,
'}'
,
'>'
,
'》'
,
'、'
,
','
,
','
,
'-'
,
'—'
,
'–'
,)
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment