Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
wangsen
MinerU
Commits
ece7f8d5
Unverified
Commit
ece7f8d5
authored
Oct 15, 2024
by
Kaiwen Liu
Committed by
GitHub
Oct 15, 2024
Browse files
Merge pull request #6 from opendatalab/dev
Dev
parents
98362a6e
702b6ac9
Changes
551
Hide whitespace changes
Inline
Side-by-side
Showing
20 changed files
with
198 additions
and
18 deletions
+198
-18
old_docs/images/layout_example.png
old_docs/images/layout_example.png
+0
-0
old_docs/images/poly.png
old_docs/images/poly.png
+0
-0
old_docs/images/project_panorama_en.png
old_docs/images/project_panorama_en.png
+0
-0
old_docs/images/project_panorama_zh_cn.png
old_docs/images/project_panorama_zh_cn.png
+0
-0
old_docs/images/spans_example.png
old_docs/images/spans_example.png
+0
-0
old_docs/images/web_demo_1.png
old_docs/images/web_demo_1.png
+0
-0
old_docs/output_file_en_us.md
old_docs/output_file_en_us.md
+0
-0
old_docs/output_file_zh_cn.md
old_docs/output_file_zh_cn.md
+0
-0
projects/README.md
projects/README.md
+4
-0
projects/README_zh-CN.md
projects/README_zh-CN.md
+4
-0
projects/gradio_app/README.md
projects/gradio_app/README.md
+24
-0
projects/gradio_app/README_zh-CN.md
projects/gradio_app/README_zh-CN.md
+24
-0
projects/gradio_app/app.py
projects/gradio_app/app.py
+23
-18
projects/gradio_app/examples/academic_paper_formula.pdf
projects/gradio_app/examples/academic_paper_formula.pdf
+0
-0
projects/gradio_app/examples/academic_paper_img_formula.pdf
projects/gradio_app/examples/academic_paper_img_formula.pdf
+0
-0
projects/gradio_app/examples/garbled_formula.pdf
projects/gradio_app/examples/garbled_formula.pdf
+0
-0
projects/gradio_app/examples/garbled_formula2.pdf
projects/gradio_app/examples/garbled_formula2.pdf
+0
-0
projects/gradio_app/examples/garbled_img_formula.pdf
projects/gradio_app/examples/garbled_img_formula.pdf
+0
-0
projects/gradio_app/examples/scanned.pdf
projects/gradio_app/examples/scanned.pdf
+0
-0
projects/gradio_app/header.html
projects/gradio_app/header.html
+119
-0
No files found.
docs/images/layout_example.png
→
old_
docs/images/layout_example.png
View file @
ece7f8d5
File moved
docs/images/poly.png
→
old_
docs/images/poly.png
View file @
ece7f8d5
File moved
docs/images/project_panorama_en.png
→
old_
docs/images/project_panorama_en.png
View file @
ece7f8d5
File moved
docs/images/project_panorama_zh_cn.png
→
old_
docs/images/project_panorama_zh_cn.png
View file @
ece7f8d5
File moved
docs/images/spans_example.png
→
old_
docs/images/spans_example.png
View file @
ece7f8d5
File moved
old_docs/images/web_demo_1.png
0 → 100644
View file @
ece7f8d5
498 KB
docs/output_file_en_us.md
→
old_
docs/output_file_en_us.md
View file @
ece7f8d5
File moved
docs/output_file_zh_cn.md
→
old_
docs/output_file_zh_cn.md
View file @
ece7f8d5
File moved
projects/README.md
View file @
ece7f8d5
...
...
@@ -3,4 +3,8 @@
## Project List
-
[
llama_index_rag
](
./llama_index_rag/README.md
)
: Build a lightweight RAG system based on llama_index
-
[
gradio_app
](
./gradio_app/README.md
)
: Build a web app based on gradio
-
[
web_demo
](
./web_demo/README.md
)
: MinerU online
[
demo
](
https://opendatalab.com/OpenSourceTools/Extractor/PDF/
)
localized deployment version
-
[
web_api
](
./web_api/README.md
)
: Web API Based on FastAPI
projects/README_zh-CN.md
View file @
ece7f8d5
...
...
@@ -3,3 +3,7 @@
## 项目列表
-
[
llama_index_rag
](
./llama_index_rag/README_zh-CN.md
)
: 基于 llama_index 构建轻量级 RAG 系统
-
[
gradio_app
](
./gradio_app/README_zh-CN.md
)
: 基于 Gradio 的 Web 应用
-
[
web_demo
](
./web_demo/README_zh-CN.md
)
: MinerU在线
[
demo
](
https://opendatalab.com/OpenSourceTools/Extractor/PDF/
)
本地化部署版本
-
[
web_api
](
./web_api/README.md
)
: 基于 FastAPI 的 Web API
projects/gradio_app/README.md
0 → 100644
View file @
ece7f8d5
## Installation
MinerU(>=0.8.0)
> If you already have a functioning MinerU environment, you can skip this step.
>
[
Deploy in CPU environment
](
https://github.com/opendatalab/MinerU?tab=readme-ov-file#quick-cpu-demo
)
[
Deploy in GPU environment
](
https://github.com/opendatalab/MinerU?tab=readme-ov-file#using-gpu
)
Third-party Software
```
bash
pip
install
gradio gradio-pdf
```
## Start Gradio App
```
bash
python app.py
```
## Use Gradio App
Access http://127.0.0.1:7860 in your web browser
\ No newline at end of file
projects/gradio_app/README_zh-CN.md
0 → 100644
View file @
ece7f8d5
## 安装
MinerU(>=0.8.0)
>如已有正常运行的MinerU环境则可以跳过此步骤
>
[
在CPU环境部署
](
https://github.com/opendatalab/MinerU/blob/master/README_zh-CN.md#%E4%BD%BF%E7%94%A8cpu%E5%BF%AB%E9%80%9F%E4%BD%93%E9%AA%8C
)
[
在GPU环境部署
](
https://github.com/opendatalab/MinerU/blob/master/README_zh-CN.md#%E4%BD%BF%E7%94%A8gpu
)
第三方软件
```
bash
pip
install
gradio gradio-pdf
```
## 启动gradio应用
```
bash
python app.py
```
## 使用gradio应用
在浏览器中访问 http://127.0.0.1:7860
\ No newline at end of file
app.py
→
projects/gradio_app/
app.py
View file @
ece7f8d5
...
...
@@ -14,8 +14,6 @@ from magic_pdf.rw.AbsReaderWriter import AbsReaderWriter
from
magic_pdf.rw.DiskReaderWriter
import
DiskReaderWriter
from
magic_pdf.tools.common
import
do_parse
,
prepare_env
os
.
system
(
"pip install gradio"
)
os
.
system
(
"pip install gradio-pdf"
)
import
gradio
as
gr
from
gradio_pdf
import
PDF
...
...
@@ -25,13 +23,16 @@ def read_fn(path):
return
disk_rw
.
read
(
os
.
path
.
basename
(
path
),
AbsReaderWriter
.
MODE_BIN
)
def
parse_pdf
(
doc_path
,
output_dir
,
end_page_id
):
def
parse_pdf
(
doc_path
,
output_dir
,
end_page_id
,
is_ocr
):
os
.
makedirs
(
output_dir
,
exist_ok
=
True
)
try
:
file_name
=
f
"
{
str
(
Path
(
doc_path
).
stem
)
}
_
{
time
.
time
()
}
"
pdf_data
=
read_fn
(
doc_path
)
parse_method
=
"auto"
if
is_ocr
:
parse_method
=
"ocr"
else
:
parse_method
=
"auto"
local_image_dir
,
local_md_dir
=
prepare_env
(
output_dir
,
file_name
,
parse_method
)
do_parse
(
output_dir
,
...
...
@@ -92,9 +93,9 @@ def replace_image_with_base64(markdown_text, image_dir_path):
return
re
.
sub
(
pattern
,
replace
,
markdown_text
)
def
to_markdown
(
file_path
,
end_pages
):
def
to_markdown
(
file_path
,
end_pages
,
is_ocr
):
# 获取识别的md文件以及压缩包文件路径
local_md_dir
,
file_name
=
parse_pdf
(
file_path
,
'./output'
,
end_pages
-
1
)
local_md_dir
,
file_name
=
parse_pdf
(
file_path
,
'./output'
,
end_pages
-
1
,
is_ocr
)
archive_zip_path
=
os
.
path
.
join
(
"./output"
,
compute_sha256
(
local_md_dir
)
+
".zip"
)
zip_archive_success
=
compress_directory_to_zip
(
local_md_dir
,
archive_zip_path
)
if
zip_archive_success
==
0
:
...
...
@@ -111,14 +112,6 @@ def to_markdown(file_path, end_pages):
return
md_content
,
txt_content
,
archive_zip_path
,
new_pdf_path
# def show_pdf(file_path):
# with open(file_path, "rb") as f:
# base64_pdf = base64.b64encode(f.read()).decode('utf-8')
# pdf_display = f'<embed src="data:application/pdf;base64,{base64_pdf}" ' \
# f'width="100%" height="1000" type="application/pdf">'
# return pdf_display
latex_delimiters
=
[{
"left"
:
"$$"
,
"right"
:
"$$"
,
"display"
:
True
},
{
"left"
:
'$'
,
"right"
:
'$'
,
"display"
:
False
}]
...
...
@@ -141,16 +134,29 @@ model_init = init_model()
logger
.
info
(
f
"model_init:
{
model_init
}
"
)
with
open
(
"header.html"
,
"r"
)
as
file
:
header
=
file
.
read
()
if
__name__
==
"__main__"
:
with
gr
.
Blocks
()
as
demo
:
gr
.
HTML
(
header
)
with
gr
.
Row
():
with
gr
.
Column
(
variant
=
'panel'
,
scale
=
5
):
pdf_show
=
gr
.
Markdown
()
max_pages
=
gr
.
Slider
(
1
,
10
,
5
,
step
=
1
,
label
=
"Max convert pages"
)
with
gr
.
Row
()
as
bu_flow
:
is_ocr
=
gr
.
Checkbox
(
label
=
"Force enable OCR"
)
change_bu
=
gr
.
Button
(
"Convert"
)
clear_bu
=
gr
.
ClearButton
([
pdf_show
],
value
=
"Clear"
)
pdf_show
=
PDF
(
label
=
"Please upload pdf"
,
interactive
=
True
,
height
=
800
)
with
gr
.
Accordion
(
"Examples:"
):
example_root
=
os
.
path
.
join
(
os
.
path
.
dirname
(
__file__
),
"examples"
)
gr
.
Examples
(
examples
=
[
os
.
path
.
join
(
example_root
,
_
)
for
_
in
os
.
listdir
(
example_root
)
if
_
.
endswith
(
"pdf"
)],
inputs
=
pdf_show
,
)
with
gr
.
Column
(
variant
=
'panel'
,
scale
=
5
):
output_file
=
gr
.
File
(
label
=
"convert result"
,
interactive
=
False
)
...
...
@@ -160,8 +166,7 @@ if __name__ == "__main__":
latex_delimiters
=
latex_delimiters
,
line_breaks
=
True
)
with
gr
.
Tab
(
"Markdown text"
):
md_text
=
gr
.
TextArea
(
lines
=
45
,
show_copy_button
=
True
)
change_bu
.
click
(
fn
=
to_markdown
,
inputs
=
[
pdf_show
,
max_pages
],
outputs
=
[
md
,
md_text
,
output_file
,
pdf_show
])
clear_bu
.
add
([
md
,
pdf_show
,
md_text
,
output_file
])
demo
.
launch
()
change_bu
.
click
(
fn
=
to_markdown
,
inputs
=
[
pdf_show
,
max_pages
,
is_ocr
],
outputs
=
[
md
,
md_text
,
output_file
,
pdf_show
])
clear_bu
.
add
([
md
,
pdf_show
,
md_text
,
output_file
,
is_ocr
])
demo
.
launch
()
\ No newline at end of file
projects/gradio_app/examples/academic_paper_formula.pdf
0 → 100755
View file @
ece7f8d5
File added
projects/gradio_app/examples/academic_paper_img_formula.pdf
0 → 100755
View file @
ece7f8d5
File added
projects/gradio_app/examples/garbled_formula.pdf
0 → 100755
View file @
ece7f8d5
File added
projects/gradio_app/examples/garbled_formula2.pdf
0 → 100755
View file @
ece7f8d5
File added
projects/gradio_app/examples/garbled_img_formula.pdf
0 → 100755
View file @
ece7f8d5
File added
projects/gradio_app/examples/scanned.pdf
0 → 100755
View file @
ece7f8d5
File added
projects/gradio_app/header.html
0 → 100644
View file @
ece7f8d5
<html><head>
<link
rel=
"stylesheet"
href=
"https://use.fontawesome.com/releases/v5.15.4/css/all.css"
>
<style>
.link-block
{
border
:
1px
solid
transparent
;
border-radius
:
24px
;
background-color
:
rgba
(
54
,
54
,
54
,
1
);
cursor
:
pointer
!important
;
}
.link-block
:hover
{
background-color
:
rgba
(
54
,
54
,
54
,
0.75
)
!important
;
cursor
:
pointer
!important
;
}
.external-link
{
display
:
inline-flex
;
align-items
:
center
;
height
:
36px
;
line-height
:
36px
;
padding
:
0
16px
;
cursor
:
pointer
!important
;
}
.external-link
,
.external-link
:hover
{
cursor
:
pointer
!important
;
}
a
{
text-decoration
:
none
;
}
</style></head>
<body>
<div
style=
"
display: flex;
flex-direction: column;
justify-content: center;
align-items: center;
text-align: center;
background: linear-gradient(45deg, #007bff 0%, #0056b3 100%);
padding: 24px;
gap: 24px;
border-radius: 8px;
"
>
<div
style=
"
display: flex;
flex-direction: column;
align-items: center;
gap: 16px;
"
>
<div
style=
"display: flex; flex-direction: column; gap: 8px"
>
<h1
style=
"
font-size: 48px;
color: #fafafa;
margin: 0;
font-family: 'Trebuchet MS', 'Lucida Sans Unicode',
'Lucida Grande', 'Lucida Sans', Arial, sans-serif;
"
>
MinerU: PDF Extraction Demo
</h1>
</div>
</div>
<p
style=
"
margin: 0;
line-height: 1.6rem;
font-size: 16px;
color: #fafafa;
opacity: 0.8;
"
>
A one-stop, open-source, high-quality data extraction tool, supports
PDF/webpage/e-book extraction.
<br>
</p>
<style>
.link-block
{
display
:
inline-block
;
}
.link-block
+
.link-block
{
margin-left
:
20px
;
}
</style>
<div
class=
"column has-text-centered"
>
<div
class=
"publication-links"
>
<!-- Code Link. -->
<span
class=
"link-block"
>
<a
href=
"https://github.com/opendatalab/MinerU"
class=
"external-link button is-normal is-rounded is-dark"
style=
"text-decoration: none; cursor: pointer"
>
<span
class=
"icon"
style=
"margin-right: 4px"
>
<i
class=
"fab fa-github"
style=
"color: white; margin-right: 4px"
></i>
</span>
<span
style=
"color: white"
>
Code
</span>
</a>
</span>
<!-- arXiv Link. -->
<span
class=
"link-block"
>
<a
href=
"https://arxiv.org/abs/2409.18839"
class=
"external-link button is-normal is-rounded is-dark"
style=
"text-decoration: none; cursor: pointer"
>
<span
class=
"icon"
style=
"margin-right: 8px"
>
<i
class=
"fas fa-file"
style=
"color: white"
></i>
</span>
<span
style=
"color: white"
>
Paper
</span>
</a>
</span>
<!-- Homepage Link. -->
<span
class=
"link-block"
>
<a
href=
"https://opendatalab.com/"
class=
"external-link button is-normal is-rounded is-dark"
style=
"text-decoration: none; cursor: pointer"
>
<span
class=
"icon"
style=
"margin-right: 8px"
>
<i
class=
"fas fa-globe"
style=
"color: white"
></i>
</span>
<span
style=
"color: white"
>
Homepage
</span>
</a>
</span>
</div>
</div>
<!-- New Demo Links -->
</div>
</body></html>
\ No newline at end of file
Prev
1
2
3
4
5
6
7
8
9
…
28
Next
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment