Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
wangsen
MinerU
Commits
5de013e6
Commit
5de013e6
authored
Jun 19, 2024
by
赵小蒙
Browse files
fix:use line_lang instead of content_lang to concatenate para
parent
5f313bd0
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
9 additions
and
2 deletions
+9
-2
magic_pdf/dict2md/ocr_mkcontent.py
magic_pdf/dict2md/ocr_mkcontent.py
+9
-2
No files found.
magic_pdf/dict2md/ocr_mkcontent.py
View file @
5de013e6
...
@@ -144,10 +144,17 @@ def ocr_mk_markdown_with_para_core_v2(paras_of_layout, mode, img_buket_path=""):
...
@@ -144,10 +144,17 @@ def ocr_mk_markdown_with_para_core_v2(paras_of_layout, mode, img_buket_path=""):
def
merge_para_with_text
(
para_block
):
def
merge_para_with_text
(
para_block
):
para_text
=
''
para_text
=
''
for
line
in
para_block
[
'lines'
]:
for
line
in
para_block
[
'lines'
]:
line_text
=
""
line_lang
=
""
for
span
in
line
[
'spans'
]:
span_type
=
span
[
'type'
]
if
span_type
==
ContentType
.
Text
:
line_text
+=
span
[
'content'
].
strip
()
if
line_text
!=
""
:
line_lang
=
detect_lang
(
line_text
)
for
span
in
line
[
'spans'
]:
for
span
in
line
[
'spans'
]:
span_type
=
span
[
'type'
]
span_type
=
span
[
'type'
]
content
=
''
content
=
''
language
=
''
if
span_type
==
ContentType
.
Text
:
if
span_type
==
ContentType
.
Text
:
content
=
span
[
'content'
]
content
=
span
[
'content'
]
language
=
detect_lang
(
content
)
language
=
detect_lang
(
content
)
...
@@ -161,7 +168,7 @@ def merge_para_with_text(para_block):
...
@@ -161,7 +168,7 @@ def merge_para_with_text(para_block):
content
=
f
"
\n
$$
\n
{
span
[
'content'
]
}
\n
$$
\n
"
content
=
f
"
\n
$$
\n
{
span
[
'content'
]
}
\n
$$
\n
"
if
content
!=
''
:
if
content
!=
''
:
if
'zh'
in
l
anguage
:
if
'zh'
in
l
ine_lang
:
# 遇到一些一个字一个span的文档,这种单字语言判断不准,需要用整行文本判断
para_text
+=
content
# 中文语境下,content间不需要空格分隔
para_text
+=
content
# 中文语境下,content间不需要空格分隔
else
:
else
:
para_text
+=
content
+
' '
# 英文语境下 content间需要空格分隔
para_text
+=
content
+
' '
# 英文语境下 content间需要空格分隔
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment