Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
wangsen
MinerU
Commits
85a4750d
Unverified
Commit
85a4750d
authored
Jul 16, 2025
by
Xiaomeng Zhao
Committed by
GitHub
Jul 16, 2025
Browse files
Merge pull request #3026 from Sidney233/dev
Dev
parents
206ed770
a7e75dc0
Changes
71
Expand all
Hide whitespace changes
Inline
Side-by-side
Showing
20 changed files
with
80 additions
and
471 deletions
+80
-471
.github/workflows/cli.yml
.github/workflows/cli.yml
+14
-10
.github/workflows/huigui.yml
.github/workflows/huigui.yml
+18
-15
mkdocs.yml
mkdocs.yml
+3
-0
pyproject.toml
pyproject.toml
+45
-2
tests/retry_env.sh
tests/retry_env.sh
+0
-28
tests/test_cli/conf/__init__py
tests/test_cli/conf/__init__py
+0
-0
tests/test_cli/conf/conf.py
tests/test_cli/conf/conf.py
+0
-10
tests/test_cli/conftest.py
tests/test_cli/conftest.py
+0
-10
tests/test_cli/lib/__init__.py
tests/test_cli/lib/__init__.py
+0
-0
tests/test_cli/lib/calculate_score.py
tests/test_cli/lib/calculate_score.py
+0
-116
tests/test_cli/lib/common.py
tests/test_cli/lib/common.py
+0
-90
tests/test_cli/lib/pre_clean.py
tests/test_cli/lib/pre_clean.py
+0
-128
tests/test_cli/lib/scoring.py
tests/test_cli/lib/scoring.py
+0
-51
tests/test_cli/magic-pdf.json
tests/test_cli/magic-pdf.json
+0
-9
tests/test_cli/pdf_dev/doc/test_mineru.docx
tests/test_cli/pdf_dev/doc/test_mineru.docx
+0
-0
tests/test_cli/pdf_dev/images/docstructbench.jpg
tests/test_cli/pdf_dev/images/docstructbench.jpg
+0
-0
tests/test_cli/pdf_dev/line1.jsonl
tests/test_cli/pdf_dev/line1.jsonl
+0
-1
tests/test_cli/pdf_dev/pdf/test_rearch_report.pdf
tests/test_cli/pdf_dev/pdf/test_rearch_report.pdf
+0
-0
tests/test_cli/pdf_dev/ppt/small.pptx
tests/test_cli/pdf_dev/ppt/small.pptx
+0
-0
tests/test_cli/pdf_dev/result.json
tests/test_cli/pdf_dev/result.json
+0
-1
No files found.
.github/workflows/cli.yml
View file @
85a4750d
...
@@ -14,27 +14,31 @@ on:
...
@@ -14,27 +14,31 @@ on:
jobs
:
jobs
:
cli-test
:
cli-test
:
if
:
github.repository == 'opendatalab/MinerU'
if
:
github.repository == 'opendatalab/MinerU'
runs-on
:
pdf
runs-on
:
ubuntu-latest
timeout-minutes
:
240
timeout-minutes
:
240
strategy
:
strategy
:
fail-fast
:
true
fail-fast
:
true
steps
:
steps
:
-
name
:
PDF cli
-
name
:
PDF cli
uses
:
actions/checkout@v
3
uses
:
actions/checkout@v
4
with
:
with
:
ref
:
dev
fetch-depth
:
2
fetch-depth
:
2
-
name
:
install uv
uses
:
astral-sh/setup-uv@v5
-
name
:
install&test
-
name
:
install&test
run
:
|
run
:
|
source activate mineru
uv --version
conda env list
uv venv --python 3.12
pip show coverag
e
source .venv/bin/activat
e
cd $GITHUB_WORKSPACE && sh tests/retry_env.sh
uv pip install .[test]
#
cd $GITHUB_WORKSPACE && python tests/clean_coverage.py
cd $GITHUB_WORKSPACE && python tests/clean_coverage.py
#
cd $GITHUB_WORKSPACE && coverage run
-m pytest tests/unittest/ --cov=magic_pdf/ --cov-report html --cov-report term-missing
cd $GITHUB_WORKSPACE && coverage run
#
cd $GITHUB_WORKSPACE && python tests/get_coverage.py
cd $GITHUB_WORKSPACE && python tests/get_coverage.py
cd $GITHUB_WORKSPACE && pytest -m P0 -s -v tests/test_cli/test_cli_sdk.py
notify_to_feishu
:
notify_to_feishu
:
if
:
${{ always() && !cancelled() && contains(needs.*.result, 'failure')}}
if
:
${{ always() && !cancelled() && contains(needs.*.result, 'failure')}}
...
...
.github/workflows/huigui.yml
View file @
85a4750d
...
@@ -12,33 +12,36 @@ on:
...
@@ -12,33 +12,36 @@ on:
-
"
**.md"
-
"
**.md"
jobs
:
jobs
:
cli-test
:
cli-test
:
if
:
github.repository == 'opendatalab/MinerU'
#
if: github.repository == 'opendatalab/MinerU'
runs-on
:
pdf
runs-on
:
ubuntu-latest
timeout-minutes
:
240
timeout-minutes
:
240
strategy
:
strategy
:
fail-fast
:
true
fail-fast
:
true
steps
:
steps
:
-
name
:
PDF cli
-
name
:
PDF cli
uses
:
actions/checkout@v
3
uses
:
actions/checkout@v
4
with
:
with
:
ref
:
dev
fetch-depth
:
2
fetch-depth
:
2
-
name
:
install uv
uses
:
astral-sh/setup-uv@v5
-
name
:
install&test
-
name
:
install&test
run
:
|
run
:
|
source activate mineru
uv --version
conda env list
uv venv --python 3.12
pip show coverage
source .venv/bin/activate
cd $GITHUB_WORKSPACE && sh tests/retry_env.sh
uv pip install .[test]
# cd $GITHUB_WORKSPACE && python tests/clean_coverage.py
cd $GITHUB_WORKSPACE && python tests/clean_coverage.py
# cd $GITHUB_WORKSPACE && coverage run -m pytest tests/unittest/ --cov=magic_pdf/ --cov-report html --cov-report term-missing
cd $GITHUB_WORKSPACE && coverage run
# cd $GITHUB_WORKSPACE && python tests/get_coverage.py
cd $GITHUB_WORKSPACE && python tests/get_coverage.py
cd $GITHUB_WORKSPACE && pytest -s -v tests/test_cli/test_cli_sdk.py
notify_to_feishu
:
notify_to_feishu
:
if
:
${{ always() && !cancelled() && contains(needs.*.result, 'failure')}}
#
if: ${{ always() && !cancelled() && contains(needs.*.result, 'failure')}}
needs
:
cli-test
needs
:
cli-test
runs-on
:
pdf
runs-on
:
ubuntu-latest
steps
:
steps
:
-
name
:
get_actor
-
name
:
get_actor
run
:
|
run
:
|
...
@@ -57,5 +60,5 @@ jobs:
...
@@ -57,5 +60,5 @@ jobs:
-
name
:
notify
-
name
:
notify
run
:
|
run
:
|
#
echo ${{ secrets.USER_ID }}
echo ${{ secrets.USER_ID }}
curl -X POST -H "Content-Type: application/json" -d '{"msg_type":"post","content":{"post":{"zh_cn":{"title":"'${{ github.repository }}' GitHubAction Failed","content":[[{"tag":"text","text":""},{"tag":"a","text":"Please click here for details ","href":"https://github.com/'${{ github.repository }}'/actions/runs/'${GITHUB_RUN_ID}'"}
,{"tag":"at","user_id":"'$USER_ID'"}
]]}}}}' $WEBHOOK_URL
curl -X POST -H "Content-Type: application/json" -d '{"msg_type":"post","content":{"post":{"zh_cn":{"title":"'${{ github.repository }}' GitHubAction Failed","content":[[{"tag":"text","text":""},{"tag":"a","text":"Please click here for details ","href":"https://github.com/'${{ github.repository }}'/actions/runs/'${GITHUB_RUN_ID}'"}]]}}}}' $WEBHOOK_URL
mkdocs.yml
View file @
85a4750d
...
@@ -100,6 +100,9 @@ plugins:
...
@@ -100,6 +100,9 @@ plugins:
-
search
-
search
-
i18n
:
-
i18n
:
docs_structure
:
folder
docs_structure
:
folder
fallback_to_default
:
true
reconfigure_material
:
true
reconfigure_search
:
true
languages
:
languages
:
-
locale
:
en
-
locale
:
en
default
:
true
default
:
true
...
...
pyproject.toml
View file @
85a4750d
...
@@ -5,7 +5,7 @@ build-backend = "setuptools.build_meta"
...
@@ -5,7 +5,7 @@ build-backend = "setuptools.build_meta"
[project]
[project]
name
=
"mineru"
name
=
"mineru"
dynamic
=
["version"]
dynamic
=
["version"]
license
=
{
text
=
"AGPL-3.0"
}
license
=
{
text
=
"AGPL-3.0"
}
description
=
"A practical tool for converting PDF to Markdown"
description
=
"A practical tool for converting PDF to Markdown"
readme
=
"README.md"
readme
=
"README.md"
requires-python
=
">=3.10,<3.14"
requires-python
=
">=3.10,<3.14"
...
@@ -38,6 +38,14 @@ dependencies = [
...
@@ -38,6 +38,14 @@ dependencies = [
]
]
[project.optional-dependencies]
[project.optional-dependencies]
test
=
[
"mineru[core]"
,
"pytest"
,
"pytest-cov"
,
"coverage"
,
"beautifulsoup4"
,
"fuzzywuzzy"
]
vlm
=
[
vlm
=
[
"transformers>=4.51.1"
,
"transformers>=4.51.1"
,
"torch>=2.6.0"
,
"torch>=2.6.0"
,
...
@@ -112,7 +120,7 @@ mineru-api = "mineru.cli.fast_api:main"
...
@@ -112,7 +120,7 @@ mineru-api = "mineru.cli.fast_api:main"
mineru-gradio
=
"mineru.cli.gradio_app:main"
mineru-gradio
=
"mineru.cli.gradio_app:main"
[tool.setuptools.dynamic]
[tool.setuptools.dynamic]
version
=
{
attr
=
"mineru.version.__version__"
}
version
=
{
attr
=
"mineru.version.__version__"
}
[tool.setuptools.packages.find]
[tool.setuptools.packages.find]
include
=
["mineru*"]
include
=
["mineru*"]
...
@@ -125,3 +133,38 @@ namespaces = false
...
@@ -125,3 +133,38 @@ namespaces = false
[tool.setuptools]
[tool.setuptools]
include-package-data
=
true
include-package-data
=
true
zip-safe
=
false
zip-safe
=
false
[tool.pytest.ini_options]
addopts
=
"-s --cov=mineru --cov-report html"
[tool.coverage.run]
command_line
=
"-m pytest tests/unittest/test_e2e.py"
source
=
["mineru/"]
omit
=
[
"*/vlm_sglang_model/*"
,
"*/gradio_app.py"
,
"*/models_download.py"
,
"*/fast_api.py"
,
"*/cli/client.py"
,
"*/sglang_engine_predictor.py"
,
"*/vlm_sglang_server.py"
,
"*/cli_parser.py"
,
"*/run_async.py"
]
[tool.coverage.html]
directory
=
"htmlcov"
[tool.coverage.report]
exclude_also
=
[
'def __repr__'
,
'if self.debug:'
,
'if settings.DEBUG'
,
'raise AssertionError'
,
'raise NotImplementedError'
,
'if 0:'
,
'if __name__ == .__main__.:'
,
'if TYPE_CHECKING:'
,
'class .*\bProtocol\):'
,
'@(abc\.)?abstractmethod'
,
]
\ No newline at end of file
tests/retry_env.sh
deleted
100644 → 0
View file @
206ed770
#!/bin/bash
max_retries
=
5
retry_count
=
0
while
true
;
do
# prepare env
#python -m pip install -r requirements-qa.txt
#python -m pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com -i https://mirrors.aliyun.com/pypi/simple
pip
install
-e
.
python
-m
pip
install
paddlepaddle-gpu
==
3.0.0b1
-i
https://www.paddlepaddle.org.cn/packages/stable/cu118/
pip
install
modelscope
wget https://gcore.jsdelivr.net/gh/opendatalab/MinerU@master/scripts/download_models.py
-O
download_models.py
python download_models.py
exit_code
=
$?
if
[
$exit_code
-eq
0
]
;
then
echo
"test.sh 成功执行!"
break
else
let
retry_count+
=
1
if
[
$retry_count
-ge
$max_retries
]
;
then
echo
"达到最大重试次数 (
$max_retries
),放弃重试。"
exit
1
fi
echo
"test.sh 执行失败 (退出码:
$exit_code
)。尝试第
$retry_count
次重试..."
sleep
5
fi
done
tests/test_cli/conf/__init__py
deleted
100644 → 0
View file @
206ed770
tests/test_cli/conf/conf.py
deleted
100644 → 0
View file @
206ed770
import
os
conf
=
{
"code_path"
:
os
.
environ
.
get
(
'GITHUB_WORKSPACE'
),
"pdf_dev_path"
:
os
.
environ
.
get
(
'GITHUB_WORKSPACE'
)
+
"/tests/test_cli/pdf_dev"
,
#"code_path": "/home/quyuan/ci/actions-runner/MinerU",
#"pdf_dev_path": "/home/quyuan/ci/actions-runner/MinerU/tests/test_cli/pdf_dev",
"pdf_res_path"
:
"/tmp/magic-pdf"
,
"jsonl_path"
:
"s3://llm-qatest-pnorm/mineru/test/line1.jsonl"
,
"s3_pdf_path"
:
"s3://llm-qatest-pnorm/mineru/test/test_rearch_report.pdf"
}
tests/test_cli/conftest.py
deleted
100644 → 0
View file @
206ed770
import
pytest
import
torch
def
clear_gpu_memory
():
'''
clear GPU memory
'''
torch
.
cuda
.
empty_cache
()
print
(
"GPU memory cleared."
)
tests/test_cli/lib/__init__.py
deleted
100644 → 0
View file @
206ed770
tests/test_cli/lib/calculate_score.py
deleted
100644 → 0
View file @
206ed770
"""
calculate_score
"""
import
os
import
re
import
json
from
Levenshtein
import
distance
from
lib
import
scoring
from
nltk.translate.bleu_score
import
sentence_bleu
,
SmoothingFunction
from
nltk.tokenize
import
word_tokenize
import
nltk
nltk
.
download
(
'punkt'
)
class
Scoring
:
"""
calculate_score
"""
def
__init__
(
self
,
result_path
):
"""
init
"""
self
.
edit_distances
=
[]
self
.
bleu_scores
=
[]
self
.
sim_scores
=
[]
self
.
filenames
=
[]
self
.
score_dict
=
{}
self
.
anntion_cnt
=
0
self
.
fw
=
open
(
result_path
,
"w+"
,
encoding
=
'utf-8'
)
def
simple_bleu_score
(
self
,
candidate
,
reference
):
"""
get bleu score
"""
candidate_tokens
=
word_tokenize
(
candidate
)
reference_tokens
=
word_tokenize
(
reference
)
return
sentence_bleu
([
reference_tokens
],
candidate_tokens
,
smoothing_function
=
SmoothingFunction
().
method1
)
def
preprocess_string
(
self
,
s
):
"""
preprocess_string
"""
sub_enter
=
re
.
sub
(
r
'\n+'
,
'
\n
'
,
s
)
return
re
.
sub
(
r
' '
,
' '
,
sub_enter
)
def
calculate_similarity
(
self
,
annotion
,
actual
,
tool_type
):
"""
calculate_similarity
"""
class_dict
=
{}
edit_distances
=
[]
bleu_scores
=
[]
sim_scores
=
list
()
total_file
=
0
for
filename
in
os
.
listdir
(
annotion
):
if
filename
.
endswith
(
'.md'
)
and
not
filename
.
startswith
(
'.'
):
total_file
=
total_file
+
1
with
open
(
os
.
path
.
join
(
annotion
,
filename
),
'r'
,
encoding
=
'utf-8'
)
as
file_a
:
content_a
=
file_a
.
read
()
self
.
anntion_cnt
=
self
.
anntion_cnt
+
1
filepath_b
=
os
.
path
.
join
(
actual
,
filename
)
if
os
.
path
.
exists
(
filepath_b
):
with
open
(
filepath_b
,
'r'
,
encoding
=
'utf-8'
)
as
file_b
:
content_b
=
file_b
.
read
()
self
.
filenames
.
append
(
filename
)
edit_dist
=
distance
(
self
.
preprocess_string
(
content_b
),
self
.
preprocess_string
(
content_a
))
/
max
(
len
(
content_a
),
len
(
content_b
))
self
.
edit_distances
.
append
(
edit_dist
)
edit_distances
.
append
(
edit_dist
)
bleu_score
=
self
.
simple_bleu_score
(
content_b
,
content_a
)
bleu_scores
.
append
(
bleu_score
)
self
.
bleu_scores
.
append
(
bleu_score
)
score
=
scoring
.
score_text
(
content_b
,
content_a
)
sim_scores
.
append
(
score
)
self
.
sim_scores
.
append
(
score
)
class_dict
[
filename
]
=
{
"edit_dist"
:
edit_dist
,
"bleu_score"
:
bleu_score
,
"sim_score"
:
score
}
self
.
score_dict
[
filename
]
=
{
"edit_dist"
:
edit_dist
,
"bleu_score"
:
bleu_score
,
"sim_score"
:
score
}
else
:
print
(
f
"File
{
filename
}
not found in actual directory."
)
class_average_edit_distance
=
sum
(
edit_distances
)
/
len
(
edit_distances
)
if
edit_distances
else
0
class_average_bleu_score
=
sum
(
bleu_scores
)
/
len
(
bleu_scores
)
if
bleu_scores
else
0
class_average_sim_score
=
sum
(
sim_scores
)
/
len
(
sim_scores
)
if
sim_scores
else
0
self
.
fw
.
write
(
json
.
dumps
(
class_dict
,
ensure_ascii
=
False
)
+
"
\n
"
)
ratio
=
len
(
class_dict
)
/
total_file
self
.
fw
.
write
(
f
"
{
tool_type
}
extract ratio:
{
ratio
}
"
+
"
\n
"
)
self
.
fw
.
write
(
f
"
{
tool_type
}
Average Levenshtein Distance:
{
class_average_edit_distance
}
"
+
"
\n
"
)
self
.
fw
.
write
(
f
"
{
tool_type
}
Average BLEU Score:
{
class_average_bleu_score
}
"
+
"
\n
"
)
self
.
fw
.
write
(
f
"
{
tool_type
}
Average Sim Score:
{
class_average_sim_score
}
"
+
"
\n
"
)
print
(
f
"
{
tool_type
}
extract ratio:
{
ratio
}
"
)
print
(
f
"
{
tool_type
}
Average Levenshtein Distance:
{
class_average_edit_distance
}
"
)
print
(
f
"
{
tool_type
}
Average BLEU Score:
{
class_average_bleu_score
}
"
)
print
(
f
"
{
tool_type
}
Average Sim Score:
{
class_average_sim_score
}
"
)
return
self
.
score_dict
def
summary_scores
(
self
):
"""
calculate the average of edit distance, bleu score and sim score
"""
over_all_dict
=
dict
()
average_edit_distance
=
sum
(
self
.
edit_distances
)
/
len
(
self
.
edit_distances
)
if
self
.
edit_distances
else
0
average_bleu_score
=
sum
(
self
.
bleu_scores
)
/
len
(
self
.
bleu_scores
)
if
self
.
bleu_scores
else
0
average_sim_score
=
sum
(
self
.
sim_scores
)
/
len
(
self
.
sim_scores
)
if
self
.
sim_scores
else
0
over_all_dict
[
"average_edit_distance"
]
=
average_edit_distance
over_all_dict
[
"average_bleu_score"
]
=
average_bleu_score
over_all_dict
[
"average_sim_score"
]
=
average_sim_score
self
.
fw
.
write
(
json
.
dumps
(
over_all_dict
,
ensure_ascii
=
False
)
+
"
\n
"
)
return
over_all_dict
def
calculate_similarity_total
(
self
,
tool_type
,
download_dir
):
"""
calculate the average of edit distance, bleu score and sim score
"""
annotion
=
os
.
path
.
join
(
download_dir
,
"annotations"
,
"cleaned"
)
actual
=
os
.
path
.
join
(
download_dir
,
tool_type
,
"cleaned"
)
score
=
self
.
calculate_similarity
(
annotion
,
actual
,
tool_type
)
return
score
tests/test_cli/lib/common.py
deleted
100644 → 0
View file @
206ed770
"""common definitions."""
import
os
import
shutil
import
re
import
json
import
torch
def
clear_gpu_memory
():
'''
clear GPU memory
'''
torch
.
cuda
.
empty_cache
()
print
(
"GPU memory cleared."
)
def
check_shell
(
cmd
):
"""shell successful."""
res
=
os
.
system
(
cmd
)
assert
res
==
0
def
update_config_file
(
file_path
,
key
,
value
):
"""update config file."""
with
open
(
file_path
,
'r'
,
encoding
=
"utf-8"
)
as
fr
:
config
=
json
.
loads
(
fr
.
read
())
config
[
key
]
=
value
# 保存修改后的内容
with
open
(
file_path
,
'w'
,
encoding
=
'utf-8'
)
as
fw
:
json
.
dump
(
config
,
fw
,
ensure_ascii
=
False
,
indent
=
4
)
def
cli_count_folders_and_check_contents
(
file_path
):
"""" count cli files."""
if
os
.
path
.
exists
(
file_path
):
for
files
in
os
.
listdir
(
file_path
):
folder_count
=
os
.
path
.
getsize
(
os
.
path
.
join
(
file_path
,
files
))
assert
folder_count
>
0
assert
len
(
os
.
listdir
(
file_path
))
>
5
def
sdk_count_folders_and_check_contents
(
file_path
):
"""count folders."""
if
os
.
path
.
exists
(
file_path
):
file_count
=
os
.
path
.
getsize
(
file_path
)
assert
file_count
>
0
else
:
exit
(
1
)
def
delete_file
(
path
):
"""delete file."""
if
not
os
.
path
.
exists
(
path
):
if
os
.
path
.
isfile
(
path
):
try
:
os
.
remove
(
path
)
print
(
f
"File '
{
path
}
' deleted."
)
except
TypeError
as
e
:
print
(
f
"Error deleting file '
{
path
}
':
{
e
}
"
)
elif
os
.
path
.
isdir
(
path
):
try
:
shutil
.
rmtree
(
path
)
print
(
f
"Directory '
{
path
}
' and its contents deleted."
)
except
TypeError
as
e
:
print
(
f
"Error deleting directory '
{
path
}
':
{
e
}
"
)
def
check_latex_table_exists
(
file_path
):
"""check latex table exists."""
pattern
=
r
'\\begin\{tabular\}.*?\\end\{tabular\}'
with
open
(
file_path
,
'r'
,
encoding
=
'utf-8'
)
as
file
:
content
=
file
.
read
()
matches
=
re
.
findall
(
pattern
,
content
,
re
.
DOTALL
)
return
len
(
matches
)
>
0
def
check_html_table_exists
(
file_path
):
"""check html table exists."""
pattern
=
r
'<table.*?>.*?</table>'
with
open
(
file_path
,
'r'
,
encoding
=
'utf-8'
)
as
file
:
content
=
file
.
read
()
matches
=
re
.
findall
(
pattern
,
content
,
re
.
DOTALL
)
return
len
(
matches
)
>
0
def
check_close_tables
(
file_path
):
"""delete no tables."""
latex_pattern
=
r
'\\begin\{tabular\}.*?\\end\{tabular\}'
html_pattern
=
r
'<table.*?>.*?</table>'
with
open
(
file_path
,
'r'
,
encoding
=
'utf-8'
)
as
file
:
content
=
file
.
read
()
latex_matches
=
re
.
findall
(
latex_pattern
,
content
,
re
.
DOTALL
)
html_matches
=
re
.
findall
(
html_pattern
,
content
,
re
.
DOTALL
)
if
len
(
latex_matches
)
==
0
and
len
(
html_matches
)
==
0
:
return
True
else
:
return
False
\ No newline at end of file
tests/test_cli/lib/pre_clean.py
deleted
100644 → 0
View file @
206ed770
"""
clean data
"""
import
argparse
import
os
import
re
import
htmltabletomd
# type: ignore
import
pypandoc
import
argparse
parser
=
argparse
.
ArgumentParser
(
description
=
"get tool type"
)
parser
.
add_argument
(
"--tool_name"
,
type
=
str
,
required
=
True
,
help
=
"input tool name"
,
)
parser
.
add_argument
(
"--download_dir"
,
type
=
str
,
required
=
True
,
help
=
"input download dir"
,
)
args
=
parser
.
parse_args
()
def
clean_markdown_images
(
content
):
"""
clean markdown images
"""
pattern
=
re
.
compile
(
r
'!\[[^\]]*\]\([^)]*\)'
,
re
.
IGNORECASE
)
cleaned_content
=
pattern
.
sub
(
''
,
content
)
return
cleaned_content
def
clean_ocrmath_photo
(
content
):
"""
clean ocrmath photo
"""
pattern
=
re
.
compile
(
r
'\\includegraphics\[.*?\]\{.*?\}'
,
re
.
IGNORECASE
)
cleaned_content
=
pattern
.
sub
(
''
,
content
)
return
cleaned_content
def
convert_html_table_to_md
(
html_table
):
"""
convert html table to markdown table
"""
lines
=
html_table
.
strip
().
split
(
'
\n
'
)
md_table
=
''
if
lines
and
'<tr>'
in
lines
[
0
]:
in_thead
=
True
for
line
in
lines
:
if
'<th>'
in
line
:
cells
=
re
.
findall
(
r
'<th>(.*?)</th>'
,
line
)
md_table
+=
'| '
+
' | '
.
join
(
cells
)
+
' |
\n
'
in_thead
=
False
elif
'<td>'
in
line
and
not
in_thead
:
cells
=
re
.
findall
(
r
'<td>(.*?)</td>'
,
line
)
md_table
+=
'| '
+
' | '
.
join
(
cells
)
+
' |
\n
'
md_table
=
md_table
.
rstrip
()
+
'
\n
'
return
md_table
def
convert_latext_to_md
(
content
):
"""
convert latex table to markdown table
"""
tables
=
re
.
findall
(
r
'\\begin\{tabular\}(.*?)\\end\{tabular\}'
,
content
,
re
.
DOTALL
)
placeholders
=
[]
for
table
in
tables
:
placeholder
=
f
"<!-- TABLE_PLACEHOLDER_
{
len
(
placeholders
)
}
-->"
replace_str
=
f
"
\\
begin{{tabular}}
{
table
}
cl
\\
end{{tabular}}"
content
=
content
.
replace
(
replace_str
,
placeholder
)
try
:
pypandoc
.
convert_text
(
replace_str
,
format
=
"latex"
,
to
=
"md"
,
outputfile
=
"output.md"
,
encoding
=
"utf-8"
)
except
:
markdown_string
=
replace_str
else
:
markdown_string
=
open
(
'output.md'
,
'r'
,
encoding
=
'utf-8'
).
read
()
placeholders
.
append
((
placeholder
,
markdown_string
))
new_content
=
content
for
placeholder
,
md_table
in
placeholders
:
new_content
=
new_content
.
replace
(
placeholder
,
md_table
)
# 写入文件
return
new_content
def
convert_htmltale_to_md
(
content
):
"""
convert html table to markdown table
"""
tables
=
re
.
findall
(
r
'<table>(.*?)</table>'
,
content
,
re
.
DOTALL
)
placeholders
=
[]
for
table
in
tables
:
placeholder
=
f
"<!-- TABLE_PLACEHOLDER_
{
len
(
placeholders
)
}
-->"
content
=
content
.
replace
(
f
"<table>
{
table
}
</table>"
,
placeholder
)
try
:
convert_table
=
htmltabletomd
.
convert_table
(
table
)
except
:
convert_table
=
table
placeholders
.
append
((
placeholder
,
convert_table
))
new_content
=
content
for
placeholder
,
md_table
in
placeholders
:
new_content
=
new_content
.
replace
(
placeholder
,
md_table
)
# 写入文件
return
new_content
def
clean_data
(
prod_type
,
download_dir
):
"""
clean data
"""
tgt_dir
=
os
.
path
.
join
(
download_dir
,
prod_type
,
"cleaned"
)
if
not
os
.
path
.
exists
(
tgt_dir
):
os
.
makedirs
(
tgt_dir
)
source_dir
=
os
.
path
.
join
(
download_dir
,
prod_type
)
filenames
=
os
.
listdir
(
source_dir
)
for
filename
in
filenames
:
if
filename
.
endswith
(
'.md'
):
input_file
=
os
.
path
.
join
(
source_dir
,
filename
)
output_file
=
os
.
path
.
join
(
tgt_dir
,
"cleaned_"
+
filename
)
with
open
(
input_file
,
'r'
,
encoding
=
'utf-8'
)
as
fr
:
content
=
fr
.
read
()
new_content
=
clean_markdown_images
(
content
)
with
open
(
output_file
,
'w'
,
encoding
=
'utf-8'
)
as
fw
:
fw
.
write
(
new_content
)
if
__name__
==
'__main__'
:
tool_type
=
args
.
tool_name
download_dir
=
args
.
download_dir
clean_data
(
tool_type
,
download_dir
)
tests/test_cli/lib/scoring.py
deleted
100644 → 0
View file @
206ed770
"""
Calculate simscore, refer to (https://github.com/VikParuchuri/marker?tab=readme-ov-file)
"""
import
math
from
rapidfuzz
import
fuzz
import
re
import
regex
from
statistics
import
mean
CHUNK_MIN_CHARS
=
25
def
chunk_text
(
text
,
chunk_len
=
500
):
chunks
=
[
text
[
i
:
i
+
chunk_len
]
for
i
in
range
(
0
,
len
(
text
),
chunk_len
)]
chunks
=
[
c
for
c
in
chunks
if
c
.
strip
()
and
len
(
c
)
>
CHUNK_MIN_CHARS
]
return
chunks
def
overlap_score
(
hypothesis_chunks
,
reference_chunks
):
if
len
(
reference_chunks
)
>
0
:
length_modifier
=
len
(
hypothesis_chunks
)
/
len
(
reference_chunks
)
else
:
length_modifier
=
0
search_distance
=
max
(
len
(
reference_chunks
)
//
5
,
10
)
chunk_scores
=
[]
for
i
,
hyp_chunk
in
enumerate
(
hypothesis_chunks
):
max_score
=
0
total_len
=
0
i_offset
=
int
(
i
*
length_modifier
)
chunk_range
=
range
(
max
(
0
,
i_offset
-
search_distance
),
min
(
len
(
reference_chunks
),
i_offset
+
search_distance
))
for
j
in
chunk_range
:
ref_chunk
=
reference_chunks
[
j
]
score
=
fuzz
.
ratio
(
hyp_chunk
,
ref_chunk
,
score_cutoff
=
30
)
/
100
if
score
>
max_score
:
max_score
=
score
total_len
=
len
(
ref_chunk
)
chunk_scores
.
append
(
max_score
)
return
chunk_scores
def
score_text
(
hypothesis
,
reference
):
# Returns a 0-1 alignment score
hypothesis_chunks
=
chunk_text
(
hypothesis
)
reference_chunks
=
chunk_text
(
reference
)
chunk_scores
=
overlap_score
(
hypothesis_chunks
,
reference_chunks
)
if
len
(
chunk_scores
)
>
0
:
mean_score
=
mean
(
chunk_scores
)
return
mean_score
else
:
return
0
#return mean(chunk_scores)
\ No newline at end of file
tests/test_cli/magic-pdf.json
deleted
100644 → 0
View file @
206ed770
{
"bucket_info"
:{
"bucket-name-1"
:[
"ak"
,
"sk"
,
"endpoint"
],
"bucket-name-2"
:[
"ak"
,
"sk"
,
"endpoint"
]
},
"temp-output-dir"
:
"/tmp"
,
"models-dir"
:
"/tmp/models"
,
"device-mode"
:
"cpu"
}
\ No newline at end of file
tests/test_cli/pdf_dev/doc/test_mineru.docx
deleted
100644 → 0
View file @
206ed770
File deleted
tests/test_cli/pdf_dev/images/docstructbench.jpg
deleted
100644 → 0
View file @
206ed770
541 KB
tests/test_cli/pdf_dev/line1.jsonl
deleted
100644 → 0
View file @
206ed770
This diff is collapsed.
Click to expand it.
tests/test_cli/pdf_dev/pdf/test_rearch_report.pdf
deleted
100644 → 0
View file @
206ed770
File deleted
tests/test_cli/pdf_dev/ppt/small.pptx
deleted
100644 → 0
View file @
206ed770
File deleted
tests/test_cli/pdf_dev/result.json
deleted
100644 → 0
View file @
206ed770
{
"average_sim_score"
:
0.6505598645664856
,
"average_edit_distance"
:
0.2514908429188901
,
"average_bleu_score"
:
0.5808819533975296
}
\ No newline at end of file
Prev
1
2
3
4
Next
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment