Unverified Commit 9d689790 authored by linfeng's avatar linfeng Committed by GitHub
Browse files

Merge branch 'opendatalab:dev' into dev

parents bcef0868 fb383ba6
<html><head>
<link rel="stylesheet" href="https://use.fontawesome.com/releases/v5.15.4/css/all.css">
<style>
.link-block {
border: 1px solid transparent;
border-radius: 24px;
background-color: rgba(54, 54, 54, 1);
cursor: pointer !important;
}
.link-block:hover {
background-color: rgba(54, 54, 54, 0.75) !important;
cursor: pointer !important;
}
.external-link {
display: inline-flex;
align-items: center;
height: 36px;
line-height: 36px;
padding: 0 16px;
cursor: pointer !important;
}
.external-link,
.external-link:hover {
cursor: pointer !important;
}
a {
text-decoration: none;
}
</style></head>
<body>
<div style="
display: flex;
flex-direction: column;
justify-content: center;
align-items: center;
text-align: center;
background: linear-gradient(45deg, #007bff 0%, #0056b3 100%);
padding: 24px;
gap: 24px;
border-radius: 8px;
">
<div style="
display: flex;
flex-direction: column;
align-items: center;
gap: 16px;
">
<div style="display: flex; flex-direction: column; gap: 8px">
<h1 style="
font-size: 48px;
color: #fafafa;
margin: 0;
font-family: 'Trebuchet MS', 'Lucida Sans Unicode',
'Lucida Grande', 'Lucida Sans', Arial, sans-serif;
">
MinerU: PDF Extraction Demo
</h1>
</div>
</div>
<p style="
margin: 0;
line-height: 1.6rem;
font-size: 16px;
color: #fafafa;
opacity: 0.8;
">
A one-stop, open-source, high-quality data extraction tool, supports
PDF/webpage/e-book extraction.<br>
</p>
<style>
.link-block {
display: inline-block;
}
.link-block + .link-block {
margin-left: 20px;
}
</style>
<div class="column has-text-centered">
<div class="publication-links">
<!-- Code Link. -->
<span class="link-block">
<a href="https://github.com/opendatalab/MinerU" class="external-link button is-normal is-rounded is-dark" style="text-decoration: none; cursor: pointer">
<span class="icon" style="margin-right: 4px">
<i class="fab fa-github" style="color: white; margin-right: 4px"></i>
</span>
<span style="color: white">Code</span>
</a>
</span>
<!-- Homepage Link. -->
<span class="link-block">
<a href="https://opendatalab.com/" class="external-link button is-normal is-rounded is-dark" style="text-decoration: none; cursor: pointer">
<span class="icon" style="margin-right: 8px">
<i class="fas fa-globe" style="color: white"></i>
</span>
<span style="color: white">Homepage</span>
</a>
</span>
</div>
</div>
<!-- New Demo Links -->
</div>
</body></html>
\ No newline at end of file
magic-pdf[full]>=0.8.0
gradio
gradio-pdf
\ No newline at end of file
## 安装 <details open="open">
<summary><h2 style="display: inline-block">目录</h2></summary>
<li><a href="#介绍">介绍</a></li>
<li><a href="#安装">安装</a></li>
<li><a href="#示例">示例</a></li>
<li><a href="#开发">开发</a></li>
</ol>
</details>
MinerU ## 介绍
```bash `MinerU` 提供数据 `API接口` 以支持用户导入数据到 `RAG` 系统。本项目将基于`通义千问`展示如何构建一个轻量级的 `RAG` 系统。
git clone https://github.com/opendatalab/MinerU.git
cd MinerU <p align="center">
<img src="rag_data_api.png" width="300px" style="vertical-align:middle;">
</p>
## 安装
conda create -n MinerU python=3.10 环境要求
conda activate MinerU
pip install .[full] --extra-index-url https://wheels.myhloli.com ```text
NVIDIA A100 80GB,
Centos 7 3.10.0-957.el7.x86_64
Client: Docker Engine - Community
Version: 24.0.5
API version: 1.43
Go version: go1.20.6
Git commit: ced0996
Built: Fri Jul 21 20:39:02 2023
OS/Arch: linux/amd64
Context: default
Server: Docker Engine - Community
Engine:
Version: 24.0.5
API version: 1.43 (minimum version 1.12)
Go version: go1.20.6
Git commit: a61e2b4
Built: Fri Jul 21 20:38:05 2023
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.6.25
GitCommit: d8f198a4ed8892c764191ef7b3b06d8a2eeb5c7f
runc:
Version: 1.1.10
GitCommit: v1.1.10-0-g18a0cb0
docker-init:
Version: 0.19.0
GitCommit: de40ad0
``` ```
请参考[文档](../../README_zh-CN.md) 安装 MinerU
第三方软件 第三方软件
```bash ```bash
# install # install
pip install modelscope==1.14.0
pip install llama-index-vector-stores-elasticsearch==0.2.0 pip install llama-index-vector-stores-elasticsearch==0.2.0
pip install llama-index-embeddings-dashscope==0.2.0 pip install llama-index-embeddings-dashscope==0.2.0
pip install llama-index-core==0.10.68 pip install llama-index-core==0.10.68
...@@ -26,39 +70,12 @@ pip install accelerate==0.33.0 ...@@ -26,39 +70,12 @@ pip install accelerate==0.33.0
pip uninstall transformer-engine pip uninstall transformer-engine
``` ```
## 环境配置
```
export DASHSCOPE_API_KEY={some_key}
export ES_USER={some_es_user}
export ES_PASSWORD={some_es_password}
export ES_URL=http://{es_url}:9200
```
DASHSCOPE_API_KEY 开通参考[文档](https://help.aliyun.com/zh/dashscope/opening-service)
## 使用
### 导入数据
```bash
python data_ingestion.py -p some.pdf # load data from pdf
or
python data_ingestion.py -p /opt/data/some_pdf_directory/ # load data from multiples pdf which under the directory of {some_pdf_directory}
```
### 查询
```bash
python query.py --question '{the_question_you_want_to_ask}'
```
## 示例 ## 示例
````bash ````bash
# 启动 es 服务 cd projects/llama_index_rag
docker compose up -d docker compose up -d
or or
...@@ -67,17 +84,41 @@ docker-compose up -d ...@@ -67,17 +84,41 @@ docker-compose up -d
# 配置环境变量 # 配置环境变量
export ES_USER=elastic export ES_USER=elastic
export ES_PASSWORD=llama_index export ES_PASSWORD=llama_index
export ES_URL=http://127.0.0.1:9200 export ES_URL=http://127.0.0.1:9200
export DASHSCOPE_API_KEY={some_key} export DASHSCOPE_API_KEY={some_key}
DASHSCOPE_API_KEY 开通参考[文档](https://help.aliyun.com/zh/dashscope/opening-service)
# 未导入数据,查询问题。返回通义千问默认答案
python query.py -q 'how about the rights of men'
## outputs
question: how about the rights of men
answer: The topic of men's rights often refers to discussions around legal, social, and political issues that affect men specifically or differently from women. Movements related to men's rights advocate for addressing areas where men face discrimination or unique challenges, such as:
Child Custody: Ensuring that men have equal opportunities for custody of their children following divorce or separation.
Domestic Violence: Recognizing that men can also be victims of domestic abuse and ensuring they have access to support services.
Mental Health and Suicide Rates: Addressing the higher rates of suicide among men and providing mental health resources.
Military Conscription: In some countries, only men are required to register for military service, which is seen as a gender-based obligation.
Workplace Safety: Historically, more men than women have been employed in high-risk occupations, leading to higher workplace injury and death rates.
Parental Leave: Advocating for paternity leave policies that allow men to take time off work for family care.
Men's rights activism often intersects with broader discussions on gender equality and aims to promote fairness and equity across genders. It's important to note that while advocating for these issues, it should be done in a way that does not detract from or oppose the goals of gender equality and the rights of other groups. The focus should be on creating a fair society where everyone has equal opportunities and protections under the law.
# 导入数据 # 导入数据
python data_ingestion.py example/data/declaration_of_the_rights_of_man_1789.pdf python data_ingestion.py -p example/data/
or
python data_ingestion.py -p example/data/declaration_of_the_rights_of_man_1789.pdf
# 导入数据后,查询问题。通义千问模型会根据 RAG 系统的检索结果,结合上下文,给出答案。
# 查询问题
python query.py -q 'how about the rights of men' python query.py -q 'how about the rights of men'
## outputs ## outputs
......
...@@ -8,7 +8,7 @@ fast-langdetect==0.2.0 ...@@ -8,7 +8,7 @@ fast-langdetect==0.2.0
wordninja>=2.0.0 wordninja>=2.0.0
scikit-learn>=1.0.2 scikit-learn>=1.0.2
pdfminer.six==20231228 pdfminer.six==20231228
unimernet==0.1.6 unimernet==0.2.1
matplotlib matplotlib
ultralytics ultralytics
paddleocr==2.7.3 paddleocr==2.7.3
......
...@@ -36,7 +36,7 @@ if __name__ == '__main__': ...@@ -36,7 +36,7 @@ if __name__ == '__main__':
"paddlepaddle==3.0.0b1;platform_system=='Linux'", "paddlepaddle==3.0.0b1;platform_system=='Linux'",
"paddlepaddle==2.6.1;platform_system=='Windows' or platform_system=='Darwin'", "paddlepaddle==2.6.1;platform_system=='Windows' or platform_system=='Darwin'",
], ],
"full": ["unimernet==0.1.6", # 0.1.6版本大幅裁剪依赖包范围,推荐使用此版本 "full": ["unimernet==0.2.1", # unimernet升级0.2.1
"matplotlib<=3.9.0;platform_system=='Windows'", # 3.9.1及之后不提供windows的预编译包,避免一些没有编译环境的windows设备安装失败 "matplotlib<=3.9.0;platform_system=='Windows'", # 3.9.1及之后不提供windows的预编译包,避免一些没有编译环境的windows设备安装失败
"matplotlib;platform_system=='Linux' or platform_system=='Darwin'", # linux 和 macos 不应限制matplotlib的最高版本,以避免无法更新导致的一些bug "matplotlib;platform_system=='Linux' or platform_system=='Darwin'", # linux 和 macos 不应限制matplotlib的最高版本,以避免无法更新导致的一些bug
"ultralytics", # yolov8,公式检测 "ultralytics", # yolov8,公式检测
......
"""
clean coverage
"""
import os
import shutil
def delete_file(path):
"""delete file."""
if not os.path.exists(path):
if os.path.isfile(path):
try:
os.remove(path)
print(f"File '{path}' deleted.")
except TypeError as e:
print(f"Error deleting file '{path}': {e}")
elif os.path.isdir(path):
try:
shutil.rmtree(path)
print(f"Directory '{path}' and its contents deleted.")
except TypeError as e:
print(f"Error deleting directory '{path}': {e}")
if __name__ == "__main__":
delete_file("htmlcov")
\ No newline at end of file
...@@ -2,7 +2,7 @@ ...@@ -2,7 +2,7 @@
get cov get cov
""" """
from bs4 import BeautifulSoup from bs4 import BeautifulSoup
import shutil
def get_covrage(): def get_covrage():
"""get covrage""" """get covrage"""
# 发送请求获取网页内容 # 发送请求获取网页内容
......
...@@ -8,7 +8,8 @@ while true; do ...@@ -8,7 +8,8 @@ while true; do
# prepare env # prepare env
source activate MinerU source activate MinerU
pip install -r requirements-qa.txt pip install -r requirements-qa.txt
pip install magic-pdf[full]==0.7.0b1 --extra-index-url https://wheels.myhloli.com -i https://pypi.tuna.tsinghua.edu.cn/simple pip uninstall magic-pdf
pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com
pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu118/ pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu118/
exit_code=$? exit_code=$?
if [ $exit_code -eq 0 ]; then if [ $exit_code -eq 0 ]; then
......
...@@ -2,6 +2,7 @@ import os ...@@ -2,6 +2,7 @@ import os
conf = { conf = {
"code_path": os.environ.get('GITHUB_WORKSPACE'), "code_path": os.environ.get('GITHUB_WORKSPACE'),
"pdf_dev_path" : os.environ.get('GITHUB_WORKSPACE') + "/tests/test_cli/pdf_dev", "pdf_dev_path" : os.environ.get('GITHUB_WORKSPACE') + "/tests/test_cli/pdf_dev",
"pdf_res_path": "/tmp/magic-pdf" "pdf_res_path": "/tmp/magic-pdf",
"jsonl_path": "s3://llm-qatest-pnorm/mineru/test/line1.jsonl",
"s3_pdf_path": "s3://llm-qatest-pnorm/mineru/test/test.pdf"
} }
\ No newline at end of file
This diff is collapsed.
This diff is collapsed.
...@@ -9,7 +9,7 @@ from lib import common ...@@ -9,7 +9,7 @@ from lib import common
import magic_pdf.model as model_config import magic_pdf.model as model_config
from magic_pdf.pipe.UNIPipe import UNIPipe from magic_pdf.pipe.UNIPipe import UNIPipe
from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter
from magic_pdf.rw.S3ReaderWriter import S3ReaderWriter
model_config.__use_inside_model__ = True model_config.__use_inside_model__ = True
pdf_res_path = conf.conf['pdf_res_path'] pdf_res_path = conf.conf['pdf_res_path']
code_path = conf.conf['code_path'] code_path = conf.conf['code_path']
...@@ -178,6 +178,95 @@ class TestCli: ...@@ -178,6 +178,95 @@ class TestCli:
common.cli_count_folders_and_check_contents( common.cli_count_folders_and_check_contents(
os.path.join(res_path, demo_name, 'ocr')) os.path.join(res_path, demo_name, 'ocr'))
@pytest.mark.P1
def test_pdf_dev_cli_local_jsonl_txt(self):
"""magic_pdf_dev cli local txt."""
jsonl_path = os.path.join(pdf_dev_path, 'line1.jsonl')
cmd = 'magic-pdf-dev --jsonl %s --method %s' % (jsonl_path, "txt")
logging.info(cmd)
os.system(cmd)
@pytest.mark.P1
def test_pdf_dev_cli_local_jsonl_ocr(self):
"""magic_pdf_dev cli local ocr."""
jsonl_path = os.path.join(pdf_dev_path, 'line1.jsonl')
cmd = 'magic-pdf-dev --jsonl %s --method %s' % (jsonl_path, 'ocr')
logging.info(cmd)
os.system(cmd)
@pytest.mark.P1
def test_pdf_dev_cli_local_jsonl_auto(self):
"""magic_pdf_dev cli local auto."""
jsonl_path = os.path.join(pdf_dev_path, 'line1.jsonl')
cmd = 'magic-pdf-dev --jsonl %s --method %s' % (jsonl_path, 'auto')
logging.info(cmd)
os.system(cmd)
@pytest.mark.P1
def test_pdf_dev_cli_s3_jsonl_txt(self):
"""magic_pdf_dev cli s3 txt."""
jsonl_path = os.path.join(pdf_dev_path, 'line1.jsonl')
cmd = 'magic-pdf-dev --jsonl %s --method %s' % (jsonl_path, "txt")
logging.info(cmd)
os.system(cmd)
@pytest.mark.P1
def test_pdf_dev_cli_s3_jsonl_ocr(self):
"""magic_pdf_dev cli s3 ocr."""
jsonl_path = os.path.join(pdf_dev_path, 'line1.jsonl')
cmd = 'magic-pdf-dev --jsonl %s --method %s' % (jsonl_path, 'ocr')
logging.info(cmd)
os.system(cmd)
@pytest.mark.P1
def test_pdf_dev_cli_s3_jsonl_auto(self):
"""magic_pdf_dev cli s3 auto."""
jsonl_path = os.path.join(pdf_dev_path, 'line1.jsonl')
cmd = 'magic-pdf-dev --jsonl %s --method %s' % (jsonl_path, 'auto')
logging.info(cmd)
os.system(cmd)
@pytest.mark.P1
def test_pdf_dev_cli_pdf_json_auto(self):
"""magic_pdf_dev cli pdf+json auto."""
json_path = os.path.join(pdf_dev_path, 'test_model.json')
pdf_path = os.path.join(pdf_dev_path, 'pdf', 'research_report_1f978cd81fb7260c8f7644039ec2c054.pdf')
cmd = 'magic-pdf-dev --pdf %s --json %s --method %s' % (pdf_path, json_path, 'auto')
logging.info(cmd)
os.system(cmd)
@pytest.mark.P1
def test_pdf_dev_cli_pdf_json_ocr(self):
"""magic_pdf_dev cli pdf+json ocr."""
json_path = os.path.join(pdf_dev_path, 'test_model.json')
pdf_path = os.path.join(pdf_dev_path, 'pdf', 'research_report_1f978cd81fb7260c8f7644039ec2c054.pdf')
cmd = 'magic-pdf-dev --pdf %s --json %s --method %s' % (pdf_path, json_path, 'auto')
logging.info(cmd)
os.system(cmd)
@pytest.mark.P1
def test_s3_sdk_suto(self):
pdf_ak = os.environ.get('pdf_ak', "")
pdf_sk = os.environ.get('pdf_sk', "")
pdf_bucket = os.environ.get('bucket', "")
pdf_endpoint = os.environ.get('pdf_endpoint', "")
s3_pdf_path = conf.conf["s3_pdf_path"]
image_dir = "s3://" + pdf_bucket + "/mineru/test/test.md"
s3pdf_cli = S3ReaderWriter(pdf_ak, pdf_sk, pdf_endpoint)
s3image_cli = S3ReaderWriter(pdf_ak, pdf_sk, pdf_endpoint, parent_path=image_dir)
pdf_bytes = s3pdf_cli.read(s3_pdf_path, mode=s3pdf_cli.MODE_BIN)
jso_useful_key = {"_pdf_type": "", "model_list": []}
pipe = UNIPipe(pdf_bytes, jso_useful_key, s3image_cli)
pipe.pipe_classify()
pipe.pipe_analyze()
pipe.pipe_parse()
md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
assert len(md_content) > 0
if __name__ == '__main__': if __name__ == '__main__':
pytest.main() pytest.main()
"""
test performance
"""
import os
import shutil
import json
from lib import calculate_score
import pytest
from conf import conf
code_path = os.environ.get('GITHUB_WORKSPACE')
pdf_dev_path = conf.conf["pdf_dev_path"]
pdf_res_path = conf.conf["pdf_res_path"]
class TestTable():
"""
test table
"""
def test_perf_close_table(self):
"""
test perf when close table
"""
def get_score():
"""
get score
"""
score = calculate_score.Scoring(os.path.join(pdf_dev_path, "result.json"))
score.calculate_similarity_total("mineru", pdf_dev_path)
res = score.summary_scores()
return res
"""
test table case
"""
import os
import shutil
import json
from lib import calculate_score
import pytest
from conf import conf
code_path = os.environ.get('GITHUB_WORKSPACE')
pdf_dev_path = conf.conf["pdf_dev_path"]
pdf_res_path = conf.conf["pdf_res_path"]
class TestTable():
"""
test table
"""
def test_paddle_table_master_cuda(self):
"""
select table: paddle table master,mode is cuda
"""
def test_paddle_table_master_cpu(self):
"""
select table: paddle table master, mode is cpu
"""
def test_st_table_cuda(self):
"""
select table: ST, mode is cuda
"""
def test_st_table_cpu(self):
"""
select table: ST, mode is cpu
"""
def test_close_table_cuda(self):
"""
close table, mode is cuda
"""
def get_score():
"""
get score
"""
score = calculate_score.Scoring(os.path.join(pdf_dev_path, "result.json"))
score.calculate_similarity_total("mineru", pdf_dev_path)
res = score.summary_scores()
return res
import pytest
from PIL import Image
from magic_pdf.model.ppTableModel import ppTableModel
class TestppTableModel:
def test_image2html(self):
img = Image.open("tests/unittest/test_table/assets/table.jpg")
# 修改table模型路径
config = {"device": "cuda",
"model_dir": "/home/quyuan/PDF-Extract-Kit/models/TabRec/TableMaster"}
table_model = ppTableModel(config)
res = table_model.img2html(img)
true_value = """<td><table border="1"><thead><tr><td><b>Methods</b></td><td><b>R</b></td><td><b>P</b></td><td><b>F</b></td><td><b>FPS</b></td></tr></thead><tbody><tr><td>SegLink [26]</td><td>70.0</td><td>86.0</td><td>77.0</td><td>8.9</td></tr><tr><td>PixelLink [4]</td><td>73.2</td><td>83.0</td><td>77.8</td><td>-</td></tr><tr><td>TextSnake [18]</td><td>73.9</td><td>83.2</td><td>78.3</td><td>1.1</td></tr><tr><td>TextField [37]</td><td>75.9</td><td>87.4</td><td>81.3</td><td>5.2 </td></tr><tr><td>MSR[38]</td><td>76.7</td><td>87.4</td><td>81.7</td><td>-</td></tr><tr><td>FTSN[3]</td><td>77.1</td><td>87.6</td><td>82.0</td><td>-</td></tr><tr><td>LSE[30]</td><td>81.7</td><td>84.2</td><td>82.9</td><td>-</td></tr><tr><td>CRAFT [2]</td><td>78.2</td><td>88.2</td><td>82.9</td><td>8.6</td></tr><tr><td>MCN [16]</td><td>79</td><td>88.</td><td>83</td><td>-</td></tr><tr><td>ATRR[35]</td><td>82.1</td><td>85.2</td><td>83.6</td><td>-</td></tr><tr><td>PAN [34]</td><td>83.8</td><td>84.4</td><td>84.1</td><td>30.2</td></tr><tr><td>DB[12]</td><td>79.2</td><td>91.5</td><td>84.9</td><td>32.0</td></tr><tr><td>DRRG [41]</td><td>82.30</td><td>88.05</td><td>85.08</td><td>-</td></tr><tr><td>Ours (SynText)</td><td>80.68</td><td>85.40</td><td>82.97</td><td>12.68</td></tr><tr><td>Ours (MLT-17)</td><td>84.54</td><td>86.62</td><td>85.57</td><td>12.31</td></tr></tbody></table></td>\n"""
assert res == true_value
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment