Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
wangsen
MinerU
Commits
778b1fb7
Commit
778b1fb7
authored
Apr 23, 2024
by
liukaiwen
Browse files
更新了para_split
parent
bb2bf065
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
13 additions
and
9 deletions
+13
-9
magic_pdf/para/para_split_v2.py
magic_pdf/para/para_split_v2.py
+13
-9
No files found.
magic_pdf/para/para_split_v2.py
View file @
778b1fb7
...
@@ -87,17 +87,21 @@ def __detect_list_lines(lines, new_layout_bboxes, lang):
...
@@ -87,17 +87,21 @@ def __detect_list_lines(lines, new_layout_bboxes, lang):
"""
"""
for
l
in
lines
:
for
l
in
lines
:
first_char
=
__get_span_text
(
l
[
'spans'
][
0
])[
0
]
first_char
=
__get_span_text
(
l
[
'spans'
][
0
])[
0
]
layout_left
=
__find_layout_bbox_by_line
(
l
[
'bbox'
],
new_layout_bboxes
)[
0
]
layout
=
__find_layout_bbox_by_line
(
l
[
'bbox'
],
new_layout_bboxes
)
if
l
[
'bbox'
][
0
]
==
layout_left
:
if
not
layout
:
if
first_char
.
isupper
()
or
first_char
.
isdigit
():
line_fea_encode
.
append
(
0
)
line_fea_encode
.
append
(
1
)
else
:
line_fea_encode
.
append
(
4
)
else
:
else
:
if
first_char
.
isupper
():
layout_left
=
layout
[
0
]
line_fea_encode
.
append
(
2
)
if
l
[
'bbox'
][
0
]
==
layout_left
:
if
first_char
.
isupper
()
or
first_char
.
isdigit
():
line_fea_encode
.
append
(
1
)
else
:
line_fea_encode
.
append
(
4
)
else
:
else
:
line_fea_encode
.
append
(
3
)
if
first_char
.
isupper
():
line_fea_encode
.
append
(
2
)
else
:
line_fea_encode
.
append
(
3
)
# 然后根据编码进行分段, 选出来 1,2,3连续出现至少2次的行,认为是列表。
# 然后根据编码进行分段, 选出来 1,2,3连续出现至少2次的行,认为是列表。
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment