Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
wangsen
MinerU
Commits
71a9a955
"mmdet3d/vscode:/vscode.git/clone" did not exist on "b5f1f52ff53b04d6cfaae24cbca1483e0282656c"
Unverified
Commit
71a9a955
authored
Nov 16, 2024
by
Xiaomeng Zhao
Committed by
GitHub
Nov 16, 2024
Browse files
Merge pull request #976 from opendatalab/dev
Dev to 0.9.3
parents
6083e109
8061dfce
Changes
12
Show whitespace changes
Inline
Side-by-side
Showing
12 changed files
with
306 additions
and
36 deletions
+306
-36
README.md
README.md
+5
-11
README_zh-CN.md
README_zh-CN.md
+7
-11
projects/web_demo/web_demo/api/analysis/pdf_ext.py
projects/web_demo/web_demo/api/analysis/pdf_ext.py
+17
-13
projects/web_demo/web_demo/app.py
projects/web_demo/web_demo/app.py
+1
-1
projects/web_demo/web_demo/common/mk_markdown/__init__.py
projects/web_demo/web_demo/common/mk_markdown/__init__.py
+0
-0
projects/web_demo/web_demo/common/mk_markdown/libs/__init__.py
...cts/web_demo/web_demo/common/mk_markdown/libs/__init__.py
+0
-0
projects/web_demo/web_demo/common/mk_markdown/libs/language.py
...cts/web_demo/web_demo/common/mk_markdown/libs/language.py
+36
-0
projects/web_demo/web_demo/common/mk_markdown/libs/markdown_utils.py
...b_demo/web_demo/common/mk_markdown/libs/markdown_utils.py
+31
-0
projects/web_demo/web_demo/common/mk_markdown/libs/ocr_content_type.py
...demo/web_demo/common/mk_markdown/libs/ocr_content_type.py
+38
-0
projects/web_demo/web_demo/common/mk_markdown/mk_markdown.py
projects/web_demo/web_demo/common/mk_markdown/mk_markdown.py
+169
-0
projects/web_demo/web_demo/common/mk_markdown/resources/fasttext-langdetect/lid.176.ftz
...mon/mk_markdown/resources/fasttext-langdetect/lid.176.ftz
+0
-0
projects/web_demo/web_demo/config/config.yaml
projects/web_demo/web_demo/config/config.yaml
+2
-0
No files found.
README.md
View file @
71a9a955
...
@@ -121,7 +121,7 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
...
@@ -121,7 +121,7 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
-
Preserve the structure of the original document, including headings, paragraphs, lists, etc.
-
Preserve the structure of the original document, including headings, paragraphs, lists, etc.
-
Extract images, image descriptions, tables, table titles, and footnotes.
-
Extract images, image descriptions, tables, table titles, and footnotes.
-
Automatically recognize and convert formulas in the document to LaTeX format.
-
Automatically recognize and convert formulas in the document to LaTeX format.
-
Automatically recognize and convert tables in the document to
LaTeX or
HTML format.
-
Automatically recognize and convert tables in the document to HTML format.
-
Automatically detect scanned PDFs and garbled PDFs and enable OCR functionality.
-
Automatically detect scanned PDFs and garbled PDFs and enable OCR functionality.
-
OCR supports detection and recognition of 84 languages.
-
OCR supports detection and recognition of 84 languages.
-
Supports multiple output formats, such as multimodal and NLP Markdown, JSON sorted by reading order, and rich intermediate formats.
-
Supports multiple output formats, such as multimodal and NLP Markdown, JSON sorted by reading order, and rich intermediate formats.
...
@@ -185,17 +185,11 @@ There are three different ways to experience MinerU:
...
@@ -185,17 +185,11 @@ There are three different ways to experience MinerU:
</tr>
</tr>
<tr>
<tr>
<td
rowspan=
"2"
>
GPU Hardware Support List
</td>
<td
rowspan=
"2"
>
GPU Hardware Support List
</td>
<td
colspan=
"2"
>
Minimum Requirement 8G+ VRAM
</td>
<td
colspan=
"2"
>
GPU VRAM 8GB or more
</td>
<td
colspan=
"2"
>
3060ti/3070/406
0
<br>
<td
colspan=
"2"
>
2080~2080Ti / 3060Ti~3090Ti / 4060~409
0
<br>
8G VRAM enable
s layout, formula recognition acceleration and OCR acceleration
</td>
8G VRAM
can
enable
all acceleration features
</td>
<td
rowspan=
"2"
>
None
</td>
<td
rowspan=
"2"
>
None
</td>
</tr>
</tr>
<tr>
<td
colspan=
"2"
>
Recommended Configuration 10G+ VRAM
</td>
<td
colspan=
"2"
>
3080/3080ti/3090/3090ti/4070/4070ti/4070tisuper/4080/4090
<br>
10G VRAM or more can enable layout, formula recognition, OCR acceleration and table recognition acceleration simultaneously
</td>
</tr>
</table>
</table>
### Online Demo
### Online Demo
...
@@ -247,7 +241,7 @@ You can modify certain configurations in this file to enable or disable features
...
@@ -247,7 +241,7 @@ You can modify certain configurations in this file to enable or disable features
"enable"
:
true
//
The
formula
recognition
feature
is
enabled
by
default.
If
you
need
to
disable
it
,
please
change
the
value
here
to
"false"
.
"enable"
:
true
//
The
formula
recognition
feature
is
enabled
by
default.
If
you
need
to
disable
it
,
please
change
the
value
here
to
"false"
.
},
},
"table-config"
:
{
"table-config"
:
{
"model"
:
"rapid_table"
,
//
When
using
structEqT
able
,
please
change
to
"struct_eqtable"
.
"model"
:
"rapid_table"
,
//
Default
to
using
"rapid_t
able
"
,
can
be
switched
to
"tablemaster"
or
"struct_eqtable"
.
"enable"
:
false
,
//
The
table
recognition
feature
is
disabled
by
default.
If
you
need
to
enable
it
,
please
change
the
value
here
to
"true"
.
"enable"
:
false
,
//
The
table
recognition
feature
is
disabled
by
default.
If
you
need
to
enable
it
,
please
change
the
value
here
to
"true"
.
"max_time"
:
400
"max_time"
:
400
}
}
...
...
README_zh-CN.md
View file @
71a9a955
...
@@ -121,7 +121,7 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
...
@@ -121,7 +121,7 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
-
保留原文档的结构,包括标题、段落、列表等
-
保留原文档的结构,包括标题、段落、列表等
-
提取图像、图片描述、表格、表格标题及脚注
-
提取图像、图片描述、表格、表格标题及脚注
-
自动识别并转换文档中的公式为LaTeX格式
-
自动识别并转换文档中的公式为LaTeX格式
-
自动识别并转换文档中的表格为
LaTeX或
HTML格式
-
自动识别并转换文档中的表格为HTML格式
-
自动检测扫描版PDF和乱码PDF,并启用OCR功能
-
自动检测扫描版PDF和乱码PDF,并启用OCR功能
-
OCR支持84种语言的检测与识别
-
OCR支持84种语言的检测与识别
-
支持多种输出格式,如多模态与NLP的Markdown、按阅读顺序排序的JSON、含有丰富信息的中间格式等
-
支持多种输出格式,如多模态与NLP的Markdown、按阅读顺序排序的JSON、含有丰富信息的中间格式等
...
@@ -186,17 +186,13 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
...
@@ -186,17 +186,13 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
</tr>
</tr>
<tr>
<tr>
<td
rowspan=
"2"
>
GPU硬件支持列表
</td>
<td
rowspan=
"2"
>
GPU硬件支持列表
</td>
<td
colspan=
"2"
>
最低要求 8G+显存
</td>
<td
colspan=
"2"
>
显存8G以上
</td>
<td
colspan=
"2"
>
3060ti/3070/4060
<br>
<td
colspan=
"2"
>
8G显存可开启全部加速功能(表格仅限rapid_table)
</td>
2080~2080Ti / 3060Ti~3090Ti / 4060~4090
<br>
8G显存及以上可开启全部加速功能
</td>
<td
rowspan=
"2"
>
None
</td>
<td
rowspan=
"2"
>
None
</td>
</tr>
</tr>
<tr>
<td
colspan=
"2"
>
推荐配置 10G+显存
</td>
<td
colspan=
"2"
>
3080/3080ti/3090/3090ti/4070/4070ti/4070tisuper/4080/4090
<br>
10G显存及以上可开启全部加速功能
<br>
</td>
</tr>
</table>
</table>
### 在线体验
### 在线体验
...
@@ -251,7 +247,7 @@ pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com -i h
...
@@ -251,7 +247,7 @@ pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com -i h
"
enable
": true // 公式识别功能默认是开启的,如果需要关闭请修改此处的值为"
false
"
"
enable
": true // 公式识别功能默认是开启的,如果需要关闭请修改此处的值为"
false
"
},
},
"
table-config
": {
"
table-config
": {
"
model
": "
rapid_table
", //
使用structEqTable请修改为
"
struct_eqtable
"
"
model
": "
rapid_table
", //
默认使用"
rapid_table
",可以切换为"
tablemaster
"和
"
struct_eqtable
"
"
enable
": false, // 表格识别功能默认是关闭的,如果需要开启请修改此处的值为"
true
"
"
enable
": false, // 表格识别功能默认是关闭的,如果需要开启请修改此处的值为"
true
"
"
max_time
": 400
"
max_time
": 400
}
}
...
...
projects/web_demo/web_demo/api/analysis/pdf_ext.py
View file @
71a9a955
import
json
import
json
import
re
import
re
import
os
import
shutil
import
traceback
import
traceback
from
pathlib
import
Path
from
pathlib
import
Path
from
flask
import
current_app
,
url_for
from
flask
import
current_app
,
url_for
...
@@ -7,7 +9,7 @@ from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter
...
@@ -7,7 +9,7 @@ from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter
from
magic_pdf.pipe.UNIPipe
import
UNIPipe
from
magic_pdf.pipe.UNIPipe
import
UNIPipe
import
magic_pdf.model
as
model_config
import
magic_pdf.model
as
model_config
from
magic_pdf.libs.json_compressor
import
JsonCompressor
from
magic_pdf.libs.json_compressor
import
JsonCompressor
from
magic_pdf.dict2md.ocr_mkcontent
import
ocr_mk_mm_markdown_with_para_and_pagination
from
common.mk_markdown.mk_markdown
import
ocr_mk_mm_markdown_with_para_and_pagination
from
.ext
import
find_file
from
.ext
import
find_file
from
..extentions
import
app
,
db
from
..extentions
import
app
,
db
from
.models
import
AnalysisPdf
,
AnalysisTask
from
.models
import
AnalysisPdf
,
AnalysisTask
...
@@ -17,7 +19,7 @@ from loguru import logger
...
@@ -17,7 +19,7 @@ from loguru import logger
model_config
.
__use_inside_model__
=
True
model_config
.
__use_inside_model__
=
True
def
analysis_pdf
(
image_dir
,
pdf_bytes
,
is_ocr
=
False
):
def
analysis_pdf
(
image_url_prefix
,
image_dir
,
pdf_bytes
,
is_ocr
=
False
):
try
:
try
:
model_json
=
[]
# model_json传空list使用内置模型解析
model_json
=
[]
# model_json传空list使用内置模型解析
logger
.
info
(
f
"is_ocr:
{
is_ocr
}
"
)
logger
.
info
(
f
"is_ocr:
{
is_ocr
}
"
)
...
@@ -40,7 +42,7 @@ def analysis_pdf(image_dir, pdf_bytes, is_ocr=False):
...
@@ -40,7 +42,7 @@ def analysis_pdf(image_dir, pdf_bytes, is_ocr=False):
pipe
.
pipe_parse
()
pipe
.
pipe_parse
()
pdf_mid_data
=
JsonCompressor
.
decompress_json
(
pipe
.
get_compress_pdf_mid_data
())
pdf_mid_data
=
JsonCompressor
.
decompress_json
(
pipe
.
get_compress_pdf_mid_data
())
pdf_info_list
=
pdf_mid_data
[
"pdf_info"
]
pdf_info_list
=
pdf_mid_data
[
"pdf_info"
]
md_content
=
json
.
dumps
(
ocr_mk_mm_markdown_with_para_and_pagination
(
pdf_info_list
,
image_
dir
),
md_content
=
json
.
dumps
(
ocr_mk_mm_markdown_with_para_and_pagination
(
pdf_info_list
,
image_
url_prefix
),
ensure_ascii
=
False
)
ensure_ascii
=
False
)
bbox_info
=
get_bbox_info
(
pdf_info_list
)
bbox_info
=
get_bbox_info
(
pdf_info_list
)
return
md_content
,
bbox_info
return
md_content
,
bbox_info
...
@@ -77,20 +79,22 @@ def analysis_pdf_task(pdf_dir, image_dir, pdf_path, is_ocr, analysis_pdf_id):
...
@@ -77,20 +79,22 @@ def analysis_pdf_task(pdf_dir, image_dir, pdf_path, is_ocr, analysis_pdf_id):
logger
.
info
(
f
"image_dir:
{
image_dir
}
"
)
logger
.
info
(
f
"image_dir:
{
image_dir
}
"
)
if
not
Path
(
image_dir
).
exists
():
if
not
Path
(
image_dir
).
exists
():
Path
(
image_dir
).
mkdir
(
parents
=
True
,
exist_ok
=
True
)
Path
(
image_dir
).
mkdir
(
parents
=
True
,
exist_ok
=
True
)
else
:
# 清空image_dir,避免同文件多次解析图片积累
shutil
.
rmtree
(
image_dir
,
ignore_errors
=
True
)
os
.
makedirs
(
image_dir
,
exist_ok
=
True
)
# 获取文件内容
with
open
(
pdf_path
,
'rb'
)
as
file
:
with
open
(
pdf_path
,
'rb'
)
as
file
:
pdf_bytes
=
file
.
read
()
pdf_bytes
=
file
.
read
()
md_content
,
bbox_info
=
analysis_pdf
(
image_dir
,
pdf_bytes
,
is_ocr
)
# 生成图片链接
img_list
=
Path
(
image_dir
).
glob
(
'*'
)
if
Path
(
image_dir
).
exists
()
else
[]
with
app
.
app_context
():
image_url_prefix
=
f
"http://
{
current_app
.
config
[
'SERVER_NAME'
]
}{
current_app
.
config
[
'FILE_API'
]
}
&pdf=
{
Path
(
pdf_path
).
name
}
&filename="
# 解析文件
md_content
,
bbox_info
=
analysis_pdf
(
image_url_prefix
,
image_dir
,
pdf_bytes
,
is_ocr
)
# ############ markdown #############
pdf_name
=
Path
(
pdf_path
).
name
pdf_name
=
Path
(
pdf_path
).
name
with
app
.
app_context
():
for
img
in
img_list
:
img_name
=
Path
(
img
).
name
regex
=
re
.
compile
(
fr
'.*\((.*?
{
img_name
}
)'
)
regex_result
=
regex
.
search
(
md_content
)
if
regex_result
:
img_url
=
url_for
(
'analysis.imgview'
,
filename
=
img_name
,
as_attachment
=
False
)
md_content
=
md_content
.
replace
(
regex_result
.
group
(
1
),
f
"
{
img_url
}
&pdf=
{
pdf_name
}
"
)
full_md_content
=
""
full_md_content
=
""
for
item
in
json
.
loads
(
md_content
):
for
item
in
json
.
loads
(
md_content
):
...
...
projects/web_demo/web_demo/app.py
View file @
71a9a955
...
@@ -42,7 +42,7 @@ if database:
...
@@ -42,7 +42,7 @@ if database:
ip_address
=
get_local_ip
()
ip_address
=
get_local_ip
()
port
=
config
.
get
(
"PORT"
,
5559
)
port
=
config
.
get
(
"PORT"
,
5559
)
# 配置 SERVER_NAME
# 配置 SERVER_NAME
config
[
'SERVER_NAME'
]
=
f
'
{
ip_address
}
:
5559
'
config
[
'SERVER_NAME'
]
=
f
'
{
ip_address
}
:
{
port
}
'
# 配置 APPLICATION_ROOT
# 配置 APPLICATION_ROOT
config
[
'APPLICATION_ROOT'
]
=
'/'
config
[
'APPLICATION_ROOT'
]
=
'/'
# 配置 PREFERRED_URL_SCHEME
# 配置 PREFERRED_URL_SCHEME
...
...
projects/web_demo/web_demo/common/mk_markdown/__init__.py
0 → 100644
View file @
71a9a955
projects/web_demo/web_demo/common/mk_markdown/libs/__init__.py
0 → 100644
View file @
71a9a955
projects/web_demo/web_demo/common/mk_markdown/libs/language.py
0 → 100644
View file @
71a9a955
import
os
import
unicodedata
if
not
os
.
getenv
(
"FTLANG_CACHE"
):
current_file_path
=
os
.
path
.
abspath
(
__file__
)
current_dir
=
os
.
path
.
dirname
(
current_file_path
)
root_dir
=
os
.
path
.
dirname
(
current_dir
)
ftlang_cache_dir
=
os
.
path
.
join
(
root_dir
,
'resources'
,
'fasttext-langdetect'
)
os
.
environ
[
"FTLANG_CACHE"
]
=
str
(
ftlang_cache_dir
)
# print(os.getenv("FTLANG_CACHE"))
from
fast_langdetect
import
detect_language
def
detect_lang
(
text
:
str
)
->
str
:
if
len
(
text
)
==
0
:
return
""
try
:
lang_upper
=
detect_language
(
text
)
except
:
html_no_ctrl_chars
=
''
.
join
([
l
for
l
in
text
if
unicodedata
.
category
(
l
)[
0
]
not
in
[
'C'
,
]])
lang_upper
=
detect_language
(
html_no_ctrl_chars
)
try
:
lang
=
lang_upper
.
lower
()
except
:
lang
=
""
return
lang
if
__name__
==
'__main__'
:
print
(
os
.
getenv
(
"FTLANG_CACHE"
))
print
(
detect_lang
(
"This is a test."
))
print
(
detect_lang
(
"<html>This is a test</html>"
))
print
(
detect_lang
(
"这个是中文测试。"
))
print
(
detect_lang
(
"<html>这个是中文测试。</html>"
))
projects/web_demo/web_demo/common/mk_markdown/libs/markdown_utils.py
0 → 100644
View file @
71a9a955
import
re
def
escape_special_markdown_char
(
pymu_blocks
):
"""
转义正文里对markdown语法有特殊意义的字符
"""
special_chars
=
[
"*"
,
"`"
,
"~"
,
"$"
]
for
blk
in
pymu_blocks
:
for
line
in
blk
[
'lines'
]:
for
span
in
line
[
'spans'
]:
for
char
in
special_chars
:
span_text
=
span
[
'text'
]
span_type
=
span
.
get
(
"_type"
,
None
)
if
span_type
in
[
'inline-equation'
,
'interline-equation'
]:
continue
elif
span_text
:
span
[
'text'
]
=
span
[
'text'
].
replace
(
char
,
"
\\
"
+
char
)
return
pymu_blocks
def
ocr_escape_special_markdown_char
(
content
):
"""
转义正文里对markdown语法有特殊意义的字符
"""
special_chars
=
[
"*"
,
"`"
,
"~"
,
"$"
]
for
char
in
special_chars
:
content
=
content
.
replace
(
char
,
"
\\
"
+
char
)
return
content
projects/web_demo/web_demo/common/mk_markdown/libs/ocr_content_type.py
0 → 100644
View file @
71a9a955
class
ContentType
:
Image
=
'image'
Table
=
'table'
Text
=
'text'
InlineEquation
=
'inline_equation'
InterlineEquation
=
'interline_equation'
class
BlockType
:
Image
=
'image'
ImageBody
=
'image_body'
ImageCaption
=
'image_caption'
ImageFootnote
=
'image_footnote'
Table
=
'table'
TableBody
=
'table_body'
TableCaption
=
'table_caption'
TableFootnote
=
'table_footnote'
Text
=
'text'
Title
=
'title'
InterlineEquation
=
'interline_equation'
Footnote
=
'footnote'
Discarded
=
'discarded'
class
CategoryId
:
Title
=
0
Text
=
1
Abandon
=
2
ImageBody
=
3
ImageCaption
=
4
TableBody
=
5
TableCaption
=
6
TableFootnote
=
7
InterlineEquation_Layout
=
8
InlineEquation
=
13
InterlineEquation_YOLO
=
14
OcrText
=
15
ImageFootnote
=
101
projects/web_demo/web_demo/common/mk_markdown/mk_markdown.py
0 → 100644
View file @
71a9a955
import
re
import
wordninja
from
.libs.language
import
detect_lang
from
.libs.markdown_utils
import
ocr_escape_special_markdown_char
from
.libs.ocr_content_type
import
BlockType
,
ContentType
def
__is_hyphen_at_line_end
(
line
):
"""
Check if a line ends with one or more letters followed by a hyphen.
Args:
line (str): The line of text to check.
Returns:
bool: True if the line ends with one or more letters followed by a hyphen, False otherwise.
"""
# Use regex to check if the line ends with one or more letters followed by a hyphen
return
bool
(
re
.
search
(
r
'[A-Za-z]+-\s*$'
,
line
))
def
split_long_words
(
text
):
segments
=
text
.
split
(
' '
)
for
i
in
range
(
len
(
segments
)):
words
=
re
.
findall
(
r
'\w+|[^\w]'
,
segments
[
i
],
re
.
UNICODE
)
for
j
in
range
(
len
(
words
)):
if
len
(
words
[
j
])
>
10
:
words
[
j
]
=
' '
.
join
(
wordninja
.
split
(
words
[
j
]))
segments
[
i
]
=
''
.
join
(
words
)
return
' '
.
join
(
segments
)
def
join_path
(
*
args
):
return
''
.
join
(
str
(
s
).
rstrip
(
'/'
)
for
s
in
args
)
def
ocr_mk_mm_markdown_with_para_and_pagination
(
pdf_info_dict
:
list
,
img_buket_path
):
markdown_with_para_and_pagination
=
[]
page_no
=
0
for
page_info
in
pdf_info_dict
:
paras_of_layout
=
page_info
.
get
(
'para_blocks'
)
if
not
paras_of_layout
:
continue
page_markdown
=
ocr_mk_markdown_with_para_core_v2
(
paras_of_layout
,
'mm'
,
img_buket_path
)
markdown_with_para_and_pagination
.
append
({
'page_no'
:
page_no
,
'md_content'
:
'
\n\n
'
.
join
(
page_markdown
)
})
page_no
+=
1
return
markdown_with_para_and_pagination
def
merge_para_with_text
(
para_block
):
def
detect_language
(
text
):
en_pattern
=
r
'[a-zA-Z]+'
en_matches
=
re
.
findall
(
en_pattern
,
text
)
en_length
=
sum
(
len
(
match
)
for
match
in
en_matches
)
if
len
(
text
)
>
0
:
if
en_length
/
len
(
text
)
>=
0.5
:
return
'en'
else
:
return
'unknown'
else
:
return
'empty'
para_text
=
''
for
line
in
para_block
[
'lines'
]:
line_text
=
''
line_lang
=
''
for
span
in
line
[
'spans'
]:
span_type
=
span
[
'type'
]
if
span_type
==
ContentType
.
Text
:
line_text
+=
span
[
'content'
].
strip
()
if
line_text
!=
''
:
line_lang
=
detect_lang
(
line_text
)
for
span
in
line
[
'spans'
]:
span_type
=
span
[
'type'
]
content
=
''
if
span_type
==
ContentType
.
Text
:
content
=
span
[
'content'
]
# language = detect_lang(content)
language
=
detect_language
(
content
)
if
language
==
'en'
:
# 只对英文长词进行分词处理,中文分词会丢失文本
content
=
ocr_escape_special_markdown_char
(
split_long_words
(
content
))
else
:
content
=
ocr_escape_special_markdown_char
(
content
)
elif
span_type
==
ContentType
.
InlineEquation
:
content
=
f
" $
{
span
[
'content'
]
}
$ "
elif
span_type
==
ContentType
.
InterlineEquation
:
content
=
f
"
\n
$$
\n
{
span
[
'content'
]
}
\n
$$
\n
"
if
content
!=
''
:
langs
=
[
'zh'
,
'ja'
,
'ko'
]
if
line_lang
in
langs
:
# 遇到一些一个字一个span的文档,这种单字语言判断不准,需要用整行文本判断
para_text
+=
content
# 中文/日语/韩文语境下,content间不需要空格分隔
elif
line_lang
==
'en'
:
# 如果是前一行带有-连字符,那么末尾不应该加空格
if
__is_hyphen_at_line_end
(
content
):
para_text
+=
content
[:
-
1
]
else
:
para_text
+=
content
+
' '
else
:
para_text
+=
content
+
' '
# 西方文本语境下 content间需要空格分隔
return
para_text
def
ocr_mk_markdown_with_para_core_v2
(
paras_of_layout
,
mode
,
img_buket_path
=
''
):
page_markdown
=
[]
for
para_block
in
paras_of_layout
:
para_text
=
''
para_type
=
para_block
[
'type'
]
if
para_type
==
BlockType
.
Text
:
para_text
=
merge_para_with_text
(
para_block
)
elif
para_type
==
BlockType
.
Title
:
para_text
=
f
'#
{
merge_para_with_text
(
para_block
)
}
'
elif
para_type
==
BlockType
.
InterlineEquation
:
para_text
=
merge_para_with_text
(
para_block
)
elif
para_type
==
BlockType
.
Image
:
if
mode
==
'nlp'
:
continue
elif
mode
==
'mm'
:
for
block
in
para_block
[
'blocks'
]:
# 1st.拼image_body
if
block
[
'type'
]
==
BlockType
.
ImageBody
:
for
line
in
block
[
'lines'
]:
for
span
in
line
[
'spans'
]:
if
span
[
'type'
]
==
ContentType
.
Image
:
para_text
+=
f
"
\n

}
)
\n
"
for
block
in
para_block
[
'blocks'
]:
# 2nd.拼image_caption
if
block
[
'type'
]
==
BlockType
.
ImageCaption
:
para_text
+=
merge_para_with_text
(
block
)
for
block
in
para_block
[
'blocks'
]:
# 2nd.拼image_caption
if
block
[
'type'
]
==
BlockType
.
ImageFootnote
:
para_text
+=
merge_para_with_text
(
block
)
elif
para_type
==
BlockType
.
Table
:
if
mode
==
'nlp'
:
continue
elif
mode
==
'mm'
:
for
block
in
para_block
[
'blocks'
]:
# 1st.拼table_caption
if
block
[
'type'
]
==
BlockType
.
TableCaption
:
para_text
+=
merge_para_with_text
(
block
)
for
block
in
para_block
[
'blocks'
]:
# 2nd.拼table_body
if
block
[
'type'
]
==
BlockType
.
TableBody
:
for
line
in
block
[
'lines'
]:
for
span
in
line
[
'spans'
]:
if
span
[
'type'
]
==
ContentType
.
Table
:
# if processed by table model
if
span
.
get
(
'latex'
,
''
):
para_text
+=
f
"
\n\n
$
\n
{
span
[
'latex'
]
}
\n
$
\n\n
"
elif
span
.
get
(
'html'
,
''
):
para_text
+=
f
"
\n\n
{
span
[
'html'
]
}
\n\n
"
else
:
para_text
+=
f
"
\n

}
)
\n
"
for
block
in
para_block
[
'blocks'
]:
# 3rd.拼table_footnote
if
block
[
'type'
]
==
BlockType
.
TableFootnote
:
para_text
+=
merge_para_with_text
(
block
)
if
para_text
.
strip
()
==
''
:
continue
else
:
page_markdown
.
append
(
para_text
.
strip
()
+
' '
)
return
page_markdown
projects/web_demo/web_demo/common/mk_markdown/resources/fasttext-langdetect/lid.176.ftz
0 → 100644
View file @
71a9a955
File added
projects/web_demo/web_demo/config/config.yaml
View file @
71a9a955
...
@@ -13,6 +13,8 @@ BaseConfig: &base
...
@@ -13,6 +13,8 @@ BaseConfig: &base
PDF_ANALYSIS_FOLDER
:
"
analysis_pdf"
PDF_ANALYSIS_FOLDER
:
"
analysis_pdf"
# 前端项目打包的路径
# 前端项目打包的路径
REACT_APP_DIST
:
"
../../web/dist/"
REACT_APP_DIST
:
"
../../web/dist/"
# 文件访问路径
FILE_API
:
"
/api/v2/analysis/pdf_img?as_attachment=False"
# 开发配置
# 开发配置
DevelopmentConfig
:
DevelopmentConfig
:
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment