Unverified Commit 6ab12348 authored by Xiaomeng Zhao's avatar Xiaomeng Zhao Committed by GitHub
Browse files

Merge pull request #2625 from opendatalab/release-2.0.0

Release 2.0.0
parents 9487d33d 4fbec469
安装
==============
.. toctree::
:maxdepth: 1
:caption: 安装文档
install/install
install//boost_with_cuda
install/download_model_weight_files
使用 CUDA 加速
================
如果您的设备支持 CUDA 并符合主线环境的 GPU 要求,您可以使用 GPU 加速。请选择适合您系统的指南:
- :ref:`ubuntu_22_04_lts_section`
- :ref:`windows_10_or_11_section`
- 使用 Docker 快速部署
.. admonition:: Important
:class: tip
Docker 需要至少 6GB 显存的 GPU,并且所有加速功能默认启用。
在运行此 Docker 容器之前,您可以使用以下命令检查您的设备是否支持 Docker 上的 CUDA 加速。
.. code-block:: sh
docker run --rm --gpus=all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
.. code:: sh
wget https://gcore.jsdelivr.net/gh/opendatalab/MinerU@master/docker/china/Dockerfile -O Dockerfile
docker build -t mineru:latest .
docker run -it --name mineru --gpus=all mineru:latest /bin/bash -c "echo 'source /opt/mineru_venv/bin/activate' >> ~/.bashrc && exec bash"
magic-pdf --help
.. _ubuntu_22_04_lts_section:
Ubuntu 22.04 LTS
----------------
1. 检测是否已安装 nvidia 驱动
---------------------------
.. code:: bash
nvidia-smi
如果看到类似如下的信息,说明已经安装了 nvidia 驱动,可以跳过步骤2
.. admonition:: Important
:class: tip
``CUDA Version`` 显示的版本号应 >= 12.4,如显示的版本号小于12.4,请升级驱动
.. code:: text
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.07 Driver Version: 572.83 CUDA Version: 12.8 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3060 Ti WDDM | 00000000:01:00.0 On | N/A |
| 0% 51C P8 12W / 200W | 1489MiB / 8192MiB | 5% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
2. 安装驱动
-----------
如没有驱动,则通过如下命令
.. code:: bash
sudo apt-get update
sudo apt-get install nvidia-driver-570-server
安装专有驱动,安装完成后,重启电脑
.. code:: bash
reboot
3. 安装 anacoda
--------------
如果已安装 conda,可以跳过本步骤
.. code:: bash
wget -U NoSuchBrowser/1.0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/Anaconda3-2024.06-1-Linux-x86_64.sh
bash Anaconda3-2024.06-1-Linux-x86_64.sh
最后一步输入yes,关闭终端重新打开
4. 使用 conda 创建环境
---------------------
.. code:: bash
conda create -n mineru 'python<3.13' -y
conda activate mineru
5. 安装应用
-----------
.. code:: bash
pip install -U magic-pdf[full] -i https://mirrors.aliyun.com/pypi/simple
.. admonition:: Important
:class: tip
下载完成后,务必通过以下命令确认magic-pdf的版本是否正确
.. code:: bash
magic-pdf --version
如果版本号小于1.3.0,请到issue中向我们反馈
6. 下载模型
-----------
详细参考 :doc:`download_model_weight_files`
7. 了解配置文件存放的位置
-------------------------
完成\ `6.下载模型 <#6-下载模型>`__\ 步骤后,脚本会自动生成用户目录下的magic-pdf.json文件,并自动配置默认模型路径。您可在【用户目录】下找到magic-pdf.json文件。
.. admonition:: Tip
:class: tip
linux用户目录为 “/home/用户名”
8. 第一次运行
-------------
从仓库中下载样本文件,并测试
.. code:: bash
wget https://gcore.jsdelivr.net/gh/opendatalab/MinerU@master/demo/pdfs/small_ocr.pdf
magic-pdf -p small_ocr.pdf -o ./output
9. 测试CUDA加速
---------------
如果您的显卡显存大于等于 **8GB**
,可以进行以下流程,测试CUDA解析加速效果
**1.修改【用户目录】中配置文件 magic-pdf.json 中”device-mode”的值**
.. code:: json
{
"device-mode":"cuda"
}
**2.运行以下命令测试 cuda 加速效果**
.. code:: bash
magic-pdf -p small_ocr.pdf -o ./output
.. admonition:: Tip
:class: tip
CUDA 加速是否生效可以根据 log 中输出的各个阶段的耗时来简单判断,通常情况下,cuda应比cpu更快。
.. _windows_10_or_11_section:
Windows 10/11
--------------
1. 安装 cuda 和 cuDNN
------------------
需要安装符合torch要求的cuda版本,torch目前支持11.8/12.4/12.6
- CUDA 11.8 https://developer.nvidia.com/cuda-11-8-0-download-archive
- CUDA 12.4 https://developer.nvidia.com/cuda-12-4-0-download-archive
- CUDA 12.6 https://developer.nvidia.com/cuda-12-6-0-download-archive
2. 安装 anaconda
---------------
如果已安装 conda,可以跳过本步骤
下载链接:https://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/Anaconda3-2024.06-1-Windows-x86_64.exe
3. 使用 conda 创建环境
---------------------
.. code:: bash
conda create -n mineru 'python<3.13' -y
conda activate mineru
4. 安装应用
-----------
.. code:: bash
pip install -U magic-pdf[full] -i https://mirrors.aliyun.com/pypi/simple
.. admonition:: Important
:class: tip
下载完成后,务必通过以下命令确认magic-pdf的版本是否正确
.. code:: bash
magic-pdf --version
如果版本号小于1.3.0,请到issue中向我们反馈
5. 下载模型
-----------
详细参考 :doc:`download_model_weight_files`
6. 了解配置文件存放的位置
-------------------------
完成\ `5.下载模型 <#5-下载模型>`__\ 步骤后,脚本会自动生成用户目录下的magic-pdf.json文件,并自动配置默认模型路径。您可在【用户目录】下找到 magic-pdf.json 文件。
.. admonition:: Tip
:class: tip
windows 用户目录为 “C:/Users/用户名”
7. 第一次运行
-------------
从仓库中下载样本文件,并测试
.. code:: powershell
wget https://github.com/opendatalab/MinerU/raw/master/demo/pdfs/small_ocr.pdf -O small_ocr.pdf
magic-pdf -p small_ocr.pdf -o ./output
8. 测试 CUDA 加速
---------------
如果您的显卡显存大于等于 **8GB**,可以进行以下流程,测试 CUDA 解析加速效果
**1.覆盖安装支持cuda的torch和torchvision**(请根据cuda版本选择合适的index-url,具体可参考[torch官网](https://pytorch.org/get-started/locally/))
.. code:: bash
pip install --force-reinstall torch==2.6.0 torchvision==0.21.1 "numpy<2.0.0" --index-url https://download.pytorch.org/whl/cu124
**2.修改【用户目录】中配置文件magic-pdf.json中”device-mode”的值**
.. code:: json
{
"device-mode":"cuda"
}
**3.运行以下命令测试cuda加速效果**
.. code:: bash
magic-pdf -p small_ocr.pdf -o ./output
.. admonition:: Tip
:class: tip
CUDA 加速是否生效可以根据 log 中输出的各个阶段的耗时来简单判断,通常情况下, cuda会比cpu更快。
下载模型权重文件
==================
模型下载分为初始下载和更新到模型目录。请参考相应的文档以获取如何操作的指示。
首次下载模型文件
-----------------
模型文件可以从 Hugging Face 或 Model Scope下载,由于网络原因,国内用户访问HF可能会失败,请使用 ModelScope。
方法一:从 Hugging Face 下载模型
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
使用python脚本 从Hugging Face下载模型文件
.. code:: bash
pip install huggingface_hub
wget https://gcore.jsdelivr.net/gh/opendatalab/MinerU@master/scripts/download_models_hf.py -O download_models_hf.py
python download_models_hf.py
python脚本会自动下载模型文件并配置好配置文件中的模型目录
方法二:从 ModelScope 下载模型
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
使用python脚本从 ModelScope 下载模型文件
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. code:: bash
pip install modelscope
wget https://gcore.jsdelivr.net/gh/opendatalab/MinerU@master/scripts/download_models.py -O download_models.py
python download_models.py
python脚本会自动下载模型文件并配置好配置文件中的模型目录
配置文件可以在用户目录中找到,文件名为\ ``magic-pdf.json``
.. admonition:: Tip
:class: tip
windows的用户目录为 “C:\Users\用户名”, linux用户目录为 “/home/用户名”, macOS用户目录为 “/Users/用户名”
此前下载过模型,如何更新
--------------------
1. 通过 git lfs 下载过模型
^^^^^^^^^^^^^^^^^^^^^^^
.. admonition:: Important
:class: tip
由于部分用户反馈通过git lfs下载模型文件遇到下载不全和模型文件损坏情况,现已不推荐使用该方式下载。
0.9.x及以后版本由于PDF-Extract-Kit 1.0更换仓库和新增layout排序模型,不能通过 ``git pull``\命令更新,需要使用python脚本一键更新。
当magic-pdf <= 0.8.1时,如此前通过 git lfs 下载过模型文件,可以进入到之前的下载目录中,通过 ``git pull`` 命令更新模型。
2. 通过 Hugging Face 或 Model Scope 下载过模型
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
如此前通过 HuggingFace 或 Model Scope 下载过模型,可以重复执行此前的模型下载 python 脚本,将会自动将模型目录更新到最新版本。
\ No newline at end of file
安装
=====
如果您遇到任何安装问题,请首先查阅 :doc:`../../additional_notes/faq`。如果解析结果不如预期,可参考 :doc:`../../additional_notes/known_issues`。
.. admonition:: Warning
:class: tip
**安装前必看——软硬件环境支持说明**
为了确保项目的稳定性和可靠性,我们在开发过程中仅对特定的软硬件环境进行优化和测试。这样当用户在推荐的系统配置上部署和运行项目时,能够获得最佳的性能表现和最少的兼容性问题。
通过集中资源和精力于主线环境,我们团队能够更高效地解决潜在的BUG,及时开发新功能。
在非主线环境中,由于硬件、软件配置的多样性,以及第三方依赖项的兼容性问题,我们无法100%保证项目的完全可用性。因此,对于希望在非推荐环境中使用本项目的用户,我们建议先仔细阅读文档以及 :doc:`../../additional_notes/faq` ,大多数问题已经在 :doc:`../../additional_notes/faq` 中有对应的解决方案,除此之外我们鼓励社区反馈问题,以便我们能够逐步扩大支持范围。
.. raw:: html
<style>
table, th, td {
border: 1px solid black;
border-collapse: collapse;
}
</style>
<table>
<tr>
<td colspan="3" rowspan="2">操作系统</td>
</tr>
<tr>
<td>Linux after 2019</td>
<td>Windows 10 / 11</td>
<td>macOS 11+</td>
</tr>
<tr>
<td colspan="3">CPU</td>
<td>x86_64 / arm64</td>
<td>x86_64(暂不支持ARM Windows)</td>
<td>x86_64 / arm64</td>
</tr>
<tr>
<td colspan="3">内存</td>
<td colspan="3">大于等于16GB,推荐32G以上</td>
</tr>
<tr>
<td colspan="3">存储空间</td>
<td colspan="3">大于等于20GB,推荐使用SSD以获得最佳性能</td>
</tr>
<tr>
<td colspan="3">python版本</td>
<td colspan="3">>=3.9,<=3.12</td>
</tr>
<tr>
<td colspan="3">Nvidia Driver 版本</td>
<td>latest(专有驱动)</td>
<td>latest</td>
<td>None</td>
</tr>
<tr>
<td colspan="3">CUDA环境</td>
<td>11.8/12.4/12.6</td>
<td>11.8/12.4/12.6</td>
<td>None</td>
</tr>
<tr>
<td colspan="3">CANN环境(NPU支持)</td>
<td>8.0+(Ascend 910b)</td>
<td>None</td>
<td>None</td>
</tr>
<tr>
<td rowspan="2">GPU/MPS 硬件支持列表</td>
<td colspan="2">显存6G以上</td>
<td colspan="2">
Volta(2017)及之后生产的全部带Tensor Core的GPU <br>
6G显存及以上</td>
<td rowspan="2">apple slicon</td>
</tr>
</table>
创建环境
~~~~~~~~~~
.. code-block:: shell
conda create -n mineru 'python<3.13' -y
conda activate mineru
pip install -U "magic-pdf[full]" -i https://mirrors.aliyun.com/pypi/simple
下载模型权重文件
~~~~~~~~~~~~~~~
.. code-block:: shell
pip install huggingface_hub
wget https://gcore.jsdelivr.net/gh/opendatalab/MinerU@master/scripts/download_models_hf.py -O download_models_hf.py
python download_models_hf.py
MinerU 已安装,查看 :doc:`../quick_start` 或阅读 :doc:`boost_with_cuda` 以加速推理。
快速开始
==============
从这里开始学习 MinerU 基本使用方法。若还没有安装,请参考安装文档进行安装
.. toctree::
:maxdepth: 1
:caption: 快速开始
quick_start/command_line
quick_start/to_markdown
命令行
========
.. code:: bash
magic-pdf --help
Usage: magic-pdf [OPTIONS]
Options:
-v, --version display the version and exit
-p, --path PATH local pdf filepath or directory [required]
-o, --output-dir PATH output local directory [required]
-m, --method [ocr|txt|auto] the method for parsing pdf. ocr: using ocr
technique to extract information from pdf. txt:
suitable for the text-based pdf only and
outperform ocr. auto: automatically choose the
best method for parsing pdf from ocr and txt.
without method specified, auto will be used by
default.
-l, --lang TEXT Input the languages in the pdf (if known) to
improve OCR accuracy. Optional. You should
input "Abbreviation" with language form url: ht
tps://paddlepaddle.github.io/PaddleOCR/en/ppocr
/blog/multi_languages.html#5-support-languages-
and-abbreviations
-d, --debug BOOLEAN Enables detailed debugging information during
the execution of the CLI commands.
-s, --start INTEGER The starting page for PDF parsing, beginning
from 0.
-e, --end INTEGER The ending page for PDF parsing, beginning from
0.
--help Show this message and exit.
## show version
magic-pdf -v
## command line example
magic-pdf -p {some_pdf} -o {some_output_dir} -m auto
``{some_pdf}`` 可以是单个 PDF 文件或者一个包含多个 PDF 文件的目录。 解析的结果文件存放在目录 ``{some_output_dir}`` 下。 生成的结果文件列表如下所示:
.. code:: text
├── some_pdf.md # markdown 文件
├── images # 存放图片目录
├── some_pdf_layout.pdf # layout 绘图 (包含layout阅读顺序)
├── some_pdf_middle.json # minerU 中间处理结果
├── some_pdf_model.json # 模型推理结果
├── some_pdf_origin.pdf # 原 pdf 文件
├── some_pdf_spans.pdf # 最小粒度的bbox位置信息绘图
└── some_pdf_content_list.json # 按阅读顺序排列的富文本json
.. admonition:: Tip
:class: tip
欲知更多有关结果文件的信息,请参考 :doc:`../tutorial/output_file_description`
转换为 Markdown 文件
========================
本地文件示例
^^^^^^^^^^^^^^^^^^
.. code:: python
import os
from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader
from magic_pdf.data.dataset import PymuDocDataset
from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
from magic_pdf.config.enums import SupportedPdfParseMethod
# args
pdf_file_name = "abc.pdf" # replace with the real pdf path
name_without_suff = pdf_file_name.split(".")[0]
# prepare env
local_image_dir, local_md_dir = "output/images", "output"
image_dir = str(os.path.basename(local_image_dir))
os.makedirs(local_image_dir, exist_ok=True)
image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter(
local_md_dir
)
image_dir = str(os.path.basename(local_image_dir))
# read bytes
reader1 = FileBasedDataReader("")
pdf_bytes = reader1.read(pdf_file_name) # read the pdf content
# proc
## Create Dataset Instance
ds = PymuDocDataset(pdf_bytes)
## inference
if ds.classify() == SupportedPdfParseMethod.OCR:
infer_result = ds.apply(doc_analyze, ocr=True)
## pipeline
pipe_result = infer_result.pipe_ocr_mode(image_writer)
else:
infer_result = ds.apply(doc_analyze, ocr=False)
## pipeline
pipe_result = infer_result.pipe_txt_mode(image_writer)
### draw model result on each page
infer_result.draw_model(os.path.join(local_md_dir, f"{name_without_suff}_model.pdf"))
### draw layout result on each page
pipe_result.draw_layout(os.path.join(local_md_dir, f"{name_without_suff}_layout.pdf"))
### draw spans result on each page
pipe_result.draw_span(os.path.join(local_md_dir, f"{name_without_suff}_spans.pdf"))
### dump markdown
pipe_result.dump_md(md_writer, f"{name_without_suff}.md", image_dir)
### dump content list
pipe_result.dump_content_list(md_writer, f"{name_without_suff}_content_list.json", image_dir)
对象存储文件示例
^^^^^^^^^^^^^^^^
.. code:: python
import os
from magic_pdf.data.data_reader_writer import S3DataReader, S3DataWriter
from magic_pdf.data.dataset import PymuDocDataset
from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
bucket_name = "{Your S3 Bucket Name}" # replace with real bucket name
ak = "{Your S3 access key}" # replace with real s3 access key
sk = "{Your S3 secret key}" # replace with real s3 secret key
endpoint_url = "{Your S3 endpoint_url}" # replace with real s3 endpoint_url
reader = S3DataReader('unittest/tmp/', bucket_name, ak, sk, endpoint_url) # replace `unittest/tmp` with the real s3 prefix
writer = S3DataWriter('unittest/tmp', bucket_name, ak, sk, endpoint_url)
image_writer = S3DataWriter('unittest/tmp/images', bucket_name, ak, sk, endpoint_url)
# args
pdf_file_name = (
"s3://llm-pdf-text-1/unittest/tmp/bug5-11.pdf" # replace with the real s3 path
)
# prepare env
local_dir = "output"
name_without_suff = os.path.basename(pdf_file_name).split(".")[0]
# read bytes
pdf_bytes = reader.read(pdf_file_name) # read the pdf content
# proc
## Create Dataset Instance
ds = PymuDocDataset(pdf_bytes)
## inference
if ds.classify() == SupportedPdfParseMethod.OCR:
infer_result = ds.apply(doc_analyze, ocr=True)
## pipeline
pipe_result = infer_result.pipe_ocr_mode(image_writer)
else:
infer_result = ds.apply(doc_analyze, ocr=False)
## pipeline
pipe_result = infer_result.pipe_txt_mode(image_writer)
### draw model result on each page
infer_result.draw_model(os.path.join(local_dir, f'{name_without_suff}_model.pdf')) # dump to local
### draw layout result on each page
pipe_result.draw_layout(os.path.join(local_dir, f'{name_without_suff}_layout.pdf')) # dump to local
### draw spans result on each page
pipe_result.draw_span(os.path.join(local_dir, f'{name_without_suff}_spans.pdf')) # dump to local
### dump markdown
pipe_result.dump_md(writer, f'{name_without_suff}.md', "unittest/tmp/images") # dump to remote s3
### dump content list
pipe_result.dump_content_list(md_writer, f"{name_without_suff}_content_list.json", image_dir)
前去 :doc:`../data/data_reader_writer` 获取更多有关 **读写** 示例
教程
===========
让我们通过构建一个最小项目来学习 MinerU
.. toctree::
:maxdepth: 1
:caption: 教程
tutorial/output_file_description
tutorial/pipeline
输出文件格式介绍
===============
``magic-pdf`` 命令执行后除了输出和 markdown
有关的文件以外,还会生成若干个和 markdown
无关的文件。现在将一一介绍这些文件
some_pdf_layout.pdf
~~~~~~~~~~~~~~~~~~~
每一页的 layout 均由一个或多个框组成。
每个框左上脚的数字表明它们的序号。此外 layout.pdf
框内用不同的背景色块圈定不同的内容块。
.. figure:: ../../_static/image/layout_example.png
:alt: layout 页面示例
layout 页面示例
some_pdf_spans.pdf
~~~~~~~~~~~~~~~~~~
根据 span 类型的不同,采用不同颜色线框绘制页面上所有
span。该文件可以用于质检,可以快速排查出文本丢失、行间公式未识别等问题。
.. figure:: ../../_static/image/spans_example.png
:alt: span 页面示例
span 页面示例
some_pdf_model.json
~~~~~~~~~~~~~~~~~~~
结构定义
^^^^^^^^
.. code:: python
from pydantic import BaseModel, Field
from enum import IntEnum
class CategoryType(IntEnum):
title = 0 # 标题
plain_text = 1 # 文本
abandon = 2 # 包括页眉页脚页码和页面注释
figure = 3 # 图片
figure_caption = 4 # 图片描述
table = 5 # 表格
table_caption = 6 # 表格描述
table_footnote = 7 # 表格注释
isolate_formula = 8 # 行间公式
formula_caption = 9 # 行间公式的标号
embedding = 13 # 行内公式
isolated = 14 # 行间公式
text = 15 # ocr 识别结果
class PageInfo(BaseModel):
page_no: int = Field(description="页码序号,第一页的序号是 0", ge=0)
height: int = Field(description="页面高度", gt=0)
width: int = Field(description="页面宽度", ge=0)
class ObjectInferenceResult(BaseModel):
category_id: CategoryType = Field(description="类别", ge=0)
poly: list[float] = Field(description="四边形坐标, 分别是 左上,右上,右下,左下 四点的坐标")
score: float = Field(description="推理结果的置信度")
latex: str | None = Field(description="latex 解析结果", default=None)
html: str | None = Field(description="html 解析结果", default=None)
class PageInferenceResults(BaseModel):
layout_dets: list[ObjectInferenceResult] = Field(description="页面识别结果", ge=0)
page_info: PageInfo = Field(description="页面元信息")
# 所有页面的推理结果按照页码顺序依次放到列表中即为 minerU 推理结果
inference_result: list[PageInferenceResults] = []
poly 坐标的格式 [x0, y0, x1, y1, x2, y2, x3, y3],
分别表示左上、右上、右下、左下四点的坐标 |poly 坐标示意图|
示例数据
^^^^^^^^
.. code:: json
[
{
"layout_dets": [
{
"category_id": 2,
"poly": [
99.1906967163086,
100.3119125366211,
730.3707885742188,
100.3119125366211,
730.3707885742188,
245.81326293945312,
99.1906967163086,
245.81326293945312
],
"score": 0.9999997615814209
}
],
"page_info": {
"page_no": 0,
"height": 2339,
"width": 1654
}
},
{
"layout_dets": [
{
"category_id": 5,
"poly": [
99.13092803955078,
2210.680419921875,
497.3183898925781,
2210.680419921875,
497.3183898925781,
2264.78076171875,
99.13092803955078,
2264.78076171875
],
"score": 0.9999997019767761
}
],
"page_info": {
"page_no": 1,
"height": 2339,
"width": 1654
}
}
]
some_pdf_middle.json
~~~~~~~~~~~~~~~~~~~~
+--------------------+----------------------------------------------------------+
| 字段名 | 解释 |
+====================+==========================================================+
| pdf_info | list,每个元素都是一个 |
| | dict,这个dict是每一页pdf的解析结果,详见下表 |
+--------------------+----------------------------------------------------------+
| \_parse_type | ocr \| txt,用来标识本次解析的中间态使用的模式 |
+--------------------+----------------------------------------------------------+
| \_version_name | string,表示本次解析使用的 magic-pdf 的版本号 |
+-------------------------------------------------------------------------------+
**pdf_info** 字段结构说明
+---------------------+-------------------------------------------------------+
| 字段名 | 解释 |
+=====================+=======================================================+
| preproc_blocks | pdf预处理后,未分段的中间结果 |
+---------------------+-------------------------------------------------------+
| | 布局分割的结果, |
| layout_bboxes | 含有布局的方向(垂直、水平),和bbox,按阅读顺序排序 |
+---------------------+-------------------------------------------------------+
| page_idx | 页码,从0开始 |
+---------------------+-------------------------------------------------------+
| page_size | 页面的宽度和高度 |
+---------------------+-------------------------------------------------------+
| \_layout_tree | 布局树状结构 |
+---------------------+-------------------------------------------------------+
| images | list,每个元素是一个dict,每个dict表示一个img_block |
+---------------------+-------------------------------------------------------+
| tables | list,每个元素是一个dict,每个dict表示一个table_block |
+---------------------+-------------------------------------------------------+
| | list,每个元素是一个 |
| interline_equations | dict,每个dict表示一个interline_equation_block |
+---------------------+-------------------------------------------------------+
| | List, 模型返回的需要drop的block信息 |
| discarded_blocks | |
+---------------------+-------------------------------------------------------+
| para_blocks | 将preproc_blocks进行分段之后的结果 |
+---------------------+-------------------------------------------------------+
上表中 ``para_blocks``
是个dict的数组,每个dict是一个block结构,block最多支持一次嵌套
**block**
外层block被称为一级block,一级block中的字段包括
====== ===============================================
字段名 解释
====== ===============================================
type block类型(table|image)
bbox block矩形框坐标
blocks list,里面的每个元素都是一个dict格式的二级block
====== ===============================================
一级block只有”table”和”image”两种类型,其余block均为二级block
二级block中的字段包括
+----------+----------------------------------------------------------------+
| 字 | 解释 |
| 段 | |
| 名 | |
+==========+================================================================+
| | block类型 |
| type | |
+----------+----------------------------------------------------------------+
| bbox | block矩形框坐标 |
+----------+----------------------------------------------------------------+
| lines | list,每个元素都是一个dict表示的line,用来描述一行信息的构成 |
+----------+----------------------------------------------------------------+
二级block的类型详解
================== ==============
type desc
================== ==============
image_body 图像的本体
image_caption 图像的描述文本
image_footnote 图像的脚注
table_body 表格本体
table_caption 表格的描述文本
table_footnote 表格的脚注
text 文本块
title 标题块
index 目录块
list 列表块
interline_equation 行间公式块
================== ==============
**line**
line 的 字段格式如下
+-----------+-----------------------------------------------------------------+
| 字 | 解释 |
| 段 | |
| 名 | |
+===========+=================================================================+
| bbox | line的矩形框坐标 |
+-----------+-----------------------------------------------------------------+
| spans | list, |
| | 每个元素都是一个dict表示的span,用来描述一个最小组成单元的构成 |
+-----------+-----------------------------------------------------------------+
**span**
+------------+---------------------------------------------------------+
| 字段名 | 解释 |
+============+=========================================================+
| bbox | span的矩形框坐标 |
+------------+---------------------------------------------------------+
| type | span的类型 |
+------------+---------------------------------------------------------+
| content \| | 文本类型的span使用content,图表类使用img_path |
| img_path | 用来存储实际的文本或者截图路径信息 |
+------------+---------------------------------------------------------+
span 的类型有如下几种
================== ========
type desc
================== ========
image 图片
table 表格
text 文本
inline_equation 行内公式
interline_equation 行间公式
================== ========
**总结**
span是所有元素的最小存储单元
para_blocks内存储的元素为区块信息
区块结构为
一级block(如有)->二级block->line->span
.. _示例数据-1:
示例数据
^^^^^^^^
.. code:: json
{
"pdf_info": [
{
"preproc_blocks": [
{
"type": "text",
"bbox": [
52,
61.956024169921875,
294,
82.99800872802734
],
"lines": [
{
"bbox": [
52,
61.956024169921875,
294,
72.0000228881836
],
"spans": [
{
"bbox": [
54.0,
61.956024169921875,
296.2261657714844,
72.0000228881836
],
"content": "dependent on the service headway and the reliability of the departure ",
"type": "text",
"score": 1.0
}
]
}
]
}
],
"layout_bboxes": [
{
"layout_bbox": [
52,
61,
294,
731
],
"layout_label": "V",
"sub_layout": []
}
],
"page_idx": 0,
"page_size": [
612.0,
792.0
],
"_layout_tree": [],
"images": [],
"tables": [],
"interline_equations": [],
"discarded_blocks": [],
"para_blocks": [
{
"type": "text",
"bbox": [
52,
61.956024169921875,
294,
82.99800872802734
],
"lines": [
{
"bbox": [
52,
61.956024169921875,
294,
72.0000228881836
],
"spans": [
{
"bbox": [
54.0,
61.956024169921875,
296.2261657714844,
72.0000228881836
],
"content": "dependent on the service headway and the reliability of the departure ",
"type": "text",
"score": 1.0
}
]
}
]
}
]
}
],
"_parse_type": "txt",
"_version_name": "0.6.1"
}
.. |poly 坐标示意图| image:: ../../_static/image/poly.png
流水线管道
===========
极简示例
^^^^^^^^
.. code:: python
import os
from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader
from magic_pdf.data.dataset import PymuDocDataset
from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
# args
pdf_file_name = "abc.pdf" # replace with the real pdf path
name_without_suff = pdf_file_name.split(".")[0]
# prepare env
local_image_dir, local_md_dir = "output/images", "output"
image_dir = str(os.path.basename(local_image_dir))
os.makedirs(local_image_dir, exist_ok=True)
image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter(
local_md_dir
)
image_dir = str(os.path.basename(local_image_dir))
# read bytes
reader1 = FileBasedDataReader("")
pdf_bytes = reader1.read(pdf_file_name) # read the pdf content
# proc
## Create Dataset Instance
ds = PymuDocDataset(pdf_bytes)
ds.apply(doc_analyze, ocr=True).pipe_ocr_mode(image_writer).dump_md(md_writer, f"{name_without_suff}.md", image_dir)
运行以上的代码,会得到如下的结果
.. code:: bash
output/
├── abc.md
└── images
除去初始化环境,如建立目录、导入依赖库等逻辑。真正将 ``pdf`` 转换为 ``markdown`` 的代码片段如下
.. code::
# read bytes
reader1 = FileBasedDataReader("")
pdf_bytes = reader1.read(pdf_file_name) # read the pdf content
# proc
## Create Dataset Instance
ds = PymuDocDataset(pdf_bytes)
ds.apply(doc_analyze, ocr=True).pipe_ocr_mode(image_writer).dump_md(md_writer, f"{name_without_suff}.md", image_dir)
``ds.apply(doc_analyze, ocr=True)`` 会生成 ``InferenceResult`` 对象。 ``InferenceResult`` 对象执行 ``pipe_ocr_mode`` 方法会生成 ``PipeResult`` 对象。
``PipeResult`` 对象执行 ``dump_md`` 会在指定位置生成 ``markdown`` 文件。
pipeline 的执行过程如下图所示
.. image:: ../../_static/image/pipeline.drawio.svg
.. raw:: html
<br> </br>
目前划分出数据、推理、程序处理三个阶段,分别对应着图上的 ``Dataset``, ``InferenceResult``, ``PipeResult`` 这三个实体。通过 ``apply`` , ``doc_analyze`` 或 ``pipe_ocr_mode`` 等方法链接在一起。
.. admonition:: Tip
:class: tip
要想获得更多有关 Dataset、InferenceResult、PipeResult 的使用示例子,请前往 :doc:`../quick_start/to_markdown`
要想获得更多有关 Dataset、InferenceResult、PipeResult 的细节信息请前往英文版 MinerU 文档进行查看!
管道组合
^^^^^^^^^
.. code:: python
class Dataset(ABC):
@abstractmethod
def apply(self, proc: Callable, *args, **kwargs):
"""Apply callable method which.
Args:
proc (Callable): invoke proc as follows:
proc(self, *args, **kwargs)
Returns:
Any: return the result generated by proc
"""
pass
class InferenceResult(InferenceResultBase):
def apply(self, proc: Callable, *args, **kwargs):
"""Apply callable method which.
Args:
proc (Callable): invoke proc as follows:
proc(inference_result, *args, **kwargs)
Returns:
Any: return the result generated by proc
"""
return proc(copy.deepcopy(self._infer_res), *args, **kwargs)
def pipe_ocr_mode(
self,
imageWriter: DataWriter,
start_page_id=0,
end_page_id=None,
debug_mode=False,
lang=None,
) -> PipeResult:
pass
class PipeResult:
def apply(self, proc: Callable, *args, **kwargs):
"""Apply callable method which.
Args:
proc (Callable): invoke proc as follows:
proc(pipeline_result, *args, **kwargs)
Returns:
Any: return the result generated by proc
"""
return proc(copy.deepcopy(self._pipe_res), *args, **kwargs)
``Dataset`` 、 ``InferenceResult`` 和 ``PipeResult`` 类均有 ``apply`` method。可用于组合不同阶段的运算过程。
如下所示,``MinerU`` 提供一套组合这些类的计算过程。
.. code:: python
# proc
## Create Dataset Instance
ds = PymuDocDataset(pdf_bytes)
ds.apply(doc_analyze, ocr=True).pipe_ocr_mode(image_writer).dump_md(md_writer, f"{name_without_suff}.md", image_dir)
用户可以根据的需求,自行实现一些组合用的函数。比如用户通过 ``apply`` 方法实现一个统计 ``pdf`` 文件页数的功能。
.. code:: python
from magic_pdf.data.data_reader_writer import FileBasedDataReader
from magic_pdf.data.dataset import PymuDocDataset
# args
pdf_file_name = "abc.pdf" # replace with the real pdf path
# read bytes
reader1 = FileBasedDataReader("")
pdf_bytes = reader1.read(pdf_file_name) # read the pdf content
# proc
## Create Dataset Instance
ds = PymuDocDataset(pdf_bytes)
def count_page(ds)-> int:
return len(ds)
print("page number: ", ds.apply(count_page)) # will output the page count of `abc.pdf`
...@@ -2,8 +2,10 @@ ...@@ -2,8 +2,10 @@
## Project List ## Project List
- [llama_index_rag](./llama_index_rag/README.md): Build a lightweight RAG system based on llama_index - Projects compatible with version 2.0:
- [gradio_app](./gradio_app/README.md): Build a web app based on gradio - [gradio_app](./gradio_app/README.md): Web application based on Gradio
- ~~[web_demo](./web_demo/README.md): MinerU online [demo](https://opendatalab.com/OpenSourceTools/Extractor/PDF/) localized deployment version~~(Deprecated)
- [web_api](./web_api/README.md): Web API Based on FastAPI - Projects not yet compatible with version 2.0:
- [multi_gpu](./multi_gpu/README.md): Multi-GPU parallel processing based on LitServe - [web_api](./web_api/README.md): Web API based on FastAPI
- [multi_gpu](./multi_gpu/README.md): Multi-GPU parallel processing based on LitServe
...@@ -2,8 +2,9 @@ ...@@ -2,8 +2,9 @@
## 项目列表 ## 项目列表
- [llama_index_rag](./llama_index_rag/README_zh-CN.md): 基于 llama_index 构建轻量级 RAG 系统 - 已兼容2.0版本的项目列表
- [gradio_app](./gradio_app/README_zh-CN.md): 基于 Gradio 的 Web 应用 - [gradio_app](./gradio_app/README_zh-CN.md): 基于 Gradio 的 Web 应用
- ~~[web_demo](./web_demo/README_zh-CN.md): MinerU在线[demo](https://opendatalab.com/OpenSourceTools/Extractor/PDF/)本地化部署版本~~(已过时)
- [web_api](./web_api/README.md): 基于 FastAPI 的 Web API - 未兼容2.0版本的项目列表
- [multi_gpu](./multi_gpu/README.md): 基于 LitServe 的多 GPU 并行处理 - [web_api](./web_api/README.md): 基于 FastAPI 的 Web API
- [multi_gpu](./multi_gpu/README.md): 基于 LitServe 的多 GPU 并行处理
...@@ -4,30 +4,22 @@ import base64 ...@@ -4,30 +4,22 @@ import base64
import os import os
import re import re
import time import time
import uuid
import zipfile import zipfile
from pathlib import Path from pathlib import Path
import gradio as gr import gradio as gr
import pymupdf
from gradio_pdf import PDF from gradio_pdf import PDF
from loguru import logger from loguru import logger
from magic_pdf.data.data_reader_writer import FileBasedDataReader from mineru.cli.common import prepare_env, do_parse, read_fn
from magic_pdf.libs.hash_utils import compute_sha256 from mineru.utils.hash_utils import str_sha256
from magic_pdf.tools.common import do_parse, prepare_env
def read_fn(path): def parse_pdf(doc_path, output_dir, end_page_id, is_ocr, formula_enable, table_enable, language):
disk_rw = FileBasedDataReader(os.path.dirname(path))
return disk_rw.read(os.path.basename(path))
def parse_pdf(doc_path, output_dir, end_page_id, is_ocr, layout_mode, formula_enable, table_enable, language):
os.makedirs(output_dir, exist_ok=True) os.makedirs(output_dir, exist_ok=True)
try: try:
file_name = f'{str(Path(doc_path).stem)}_{time.time()}' file_name = f'{str(Path(doc_path).stem)}_{time.strftime("%y%m%d_%H%M%S")}'
pdf_data = read_fn(doc_path) pdf_data = read_fn(doc_path)
if is_ocr: if is_ocr:
parse_method = 'ocr' parse_method = 'ocr'
...@@ -35,17 +27,14 @@ def parse_pdf(doc_path, output_dir, end_page_id, is_ocr, layout_mode, formula_en ...@@ -35,17 +27,14 @@ def parse_pdf(doc_path, output_dir, end_page_id, is_ocr, layout_mode, formula_en
parse_method = 'auto' parse_method = 'auto'
local_image_dir, local_md_dir = prepare_env(output_dir, file_name, parse_method) local_image_dir, local_md_dir = prepare_env(output_dir, file_name, parse_method)
do_parse( do_parse(
output_dir, output_dir=output_dir,
file_name, pdf_file_names=[file_name],
pdf_data, pdf_bytes_list=[pdf_data],
[], p_lang_list=[language],
parse_method, parse_method=parse_method,
False,
end_page_id=end_page_id, end_page_id=end_page_id,
layout_model=layout_mode, p_formula_enable=formula_enable,
formula_enable=formula_enable, p_table_enable=table_enable,
table_enable=table_enable,
lang=language,
) )
return local_md_dir, file_name return local_md_dir, file_name
except Exception as e: except Exception as e:
...@@ -96,12 +85,11 @@ def replace_image_with_base64(markdown_text, image_dir_path): ...@@ -96,12 +85,11 @@ def replace_image_with_base64(markdown_text, image_dir_path):
return re.sub(pattern, replace, markdown_text) return re.sub(pattern, replace, markdown_text)
def to_markdown(file_path, end_pages, is_ocr, layout_mode, formula_enable, table_enable, language): def to_markdown(file_path, end_pages, is_ocr, formula_enable, table_enable, language):
file_path = to_pdf(file_path) file_path = to_pdf(file_path)
# 获取识别的md文件以及压缩包文件路径 # 获取识别的md文件以及压缩包文件路径
local_md_dir, file_name = parse_pdf(file_path, './output', end_pages - 1, is_ocr, local_md_dir, file_name = parse_pdf(file_path, './output', end_pages - 1, is_ocr, formula_enable, table_enable, language)
layout_mode, formula_enable, table_enable, language) archive_zip_path = os.path.join('./output', str_sha256(local_md_dir) + '.zip')
archive_zip_path = os.path.join('./output', compute_sha256(local_md_dir) + '.zip')
zip_archive_success = compress_directory_to_zip(local_md_dir, archive_zip_path) zip_archive_success = compress_directory_to_zip(local_md_dir, archive_zip_path)
if zip_archive_success == 0: if zip_archive_success == 0:
logger.info('压缩成功') logger.info('压缩成功')
...@@ -125,24 +113,6 @@ latex_delimiters = [ ...@@ -125,24 +113,6 @@ latex_delimiters = [
] ]
def init_model():
from magic_pdf.model.doc_analyze_by_custom_model import ModelSingleton
try:
model_manager = ModelSingleton()
txt_model = model_manager.get_model(False, False) # noqa: F841
logger.info('txt_model init final')
ocr_model = model_manager.get_model(True, False) # noqa: F841
logger.info('ocr_model init final')
return 0
except Exception as e:
logger.exception(e)
return -1
model_init = init_model()
logger.info(f'model_init: {model_init}')
with open('header.html', 'r') as file: with open('header.html', 'r') as file:
header = file.read() header = file.read()
...@@ -171,24 +141,30 @@ all_lang = [] ...@@ -171,24 +141,30 @@ all_lang = []
all_lang.extend([*other_lang, *add_lang]) all_lang.extend([*other_lang, *add_lang])
def safe_stem(file_path):
stem = Path(file_path).stem
# 只保留字母、数字、下划线和点,其他字符替换为下划线
return re.sub(r'[^\w.]', '_', stem)
def to_pdf(file_path): def to_pdf(file_path):
with pymupdf.open(file_path) as f:
if f.is_pdf:
return file_path
else:
pdf_bytes = f.convert_to_pdf()
# 将pdfbytes 写入到uuid.pdf中
# 生成唯一的文件名
unique_filename = f'{uuid.uuid4()}.pdf'
# 构建完整的文件路径 if file_path is None:
tmp_file_path = os.path.join(os.path.dirname(file_path), unique_filename) return None
pdf_bytes = read_fn(file_path)
# 将字节数据写入文件 # unique_filename = f'{uuid.uuid4()}.pdf'
with open(tmp_file_path, 'wb') as tmp_pdf_file: unique_filename = f'{safe_stem(file_path)}.pdf'
tmp_pdf_file.write(pdf_bytes)
return tmp_file_path # 构建完整的文件路径
tmp_file_path = os.path.join(os.path.dirname(file_path), unique_filename)
# 将字节数据写入文件
with open(tmp_file_path, 'wb') as tmp_pdf_file:
tmp_pdf_file.write(pdf_bytes)
return tmp_file_path
if __name__ == '__main__': if __name__ == '__main__':
...@@ -196,14 +172,16 @@ if __name__ == '__main__': ...@@ -196,14 +172,16 @@ if __name__ == '__main__':
gr.HTML(header) gr.HTML(header)
with gr.Row(): with gr.Row():
with gr.Column(variant='panel', scale=5): with gr.Column(variant='panel', scale=5):
file = gr.File(label='Please upload a PDF or image', file_types=['.pdf', '.png', '.jpeg', '.jpg'])
max_pages = gr.Slider(1, 20, 10, step=1, label='Max convert pages')
with gr.Row(): with gr.Row():
layout_mode = gr.Dropdown(['doclayout_yolo'], label='Layout model', value='doclayout_yolo') file = gr.File(label='Please upload a PDF or image', file_types=['.pdf', '.png', '.jpeg', '.jpg'])
language = gr.Dropdown(all_lang, label='Language', value='ch') with gr.Row(equal_height=True):
with gr.Column(scale=4):
max_pages = gr.Slider(1, 20, 10, step=1, label='Max convert pages')
with gr.Column(scale=1):
language = gr.Dropdown(all_lang, label='Language', value='ch')
with gr.Row(): with gr.Row():
formula_enable = gr.Checkbox(label='Enable formula recognition', value=True)
is_ocr = gr.Checkbox(label='Force enable OCR', value=False) is_ocr = gr.Checkbox(label='Force enable OCR', value=False)
formula_enable = gr.Checkbox(label='Enable formula recognition', value=True)
table_enable = gr.Checkbox(label='Enable table recognition(test)', value=True) table_enable = gr.Checkbox(label='Enable table recognition(test)', value=True)
with gr.Row(): with gr.Row():
change_bu = gr.Button('Convert') change_bu = gr.Button('Convert')
...@@ -227,7 +205,7 @@ if __name__ == '__main__': ...@@ -227,7 +205,7 @@ if __name__ == '__main__':
with gr.Tab('Markdown text'): with gr.Tab('Markdown text'):
md_text = gr.TextArea(lines=45, show_copy_button=True) md_text = gr.TextArea(lines=45, show_copy_button=True)
file.change(fn=to_pdf, inputs=file, outputs=pdf_show) file.change(fn=to_pdf, inputs=file, outputs=pdf_show)
change_bu.click(fn=to_markdown, inputs=[file, max_pages, is_ocr, layout_mode, formula_enable, table_enable, language], change_bu.click(fn=to_markdown, inputs=[file, max_pages, is_ocr, formula_enable, table_enable, language],
outputs=[md, md_text, output_file, pdf_show]) outputs=[md, md_text, output_file, pdf_show])
clear_bu.add([file, md, pdf_show, md_text, output_file, is_ocr]) clear_bu.add([file, md, pdf_show, md_text, output_file, is_ocr])
......
## Installation
MinerU
```bash
git clone https://github.com/opendatalab/MinerU.git
cd MinerU
conda create -n MinerU python=3.10
conda activate MinerU
pip install .[full] --extra-index-url https://wheels.myhloli.com
```
Third-party software
```bash
# install
pip install llama-index-vector-stores-elasticsearch==0.2.0
pip install llama-index-embeddings-dashscope==0.2.0
pip install llama-index-core==0.10.68
pip install einops==0.7.0
pip install transformers-stream-generator==0.0.5
pip install accelerate==0.33.0
# uninstall
pip uninstall transformer-engine
```
## Environment Configuration
```
export DASHSCOPE_API_KEY={some_key}
export ES_USER={some_es_user}
export ES_PASSWORD={some_es_password}
export ES_URL=http://{es_url}:9200
```
For instructions on obtaining a DASHSCOPE_API_KEY, refer to [documentation](https://help.aliyun.com/zh/dashscope/opening-service)
## Usage
### Data Ingestion
```bash
python data_ingestion.py -p some.pdf # load data from pdf
or
python data_ingestion.py -p /opt/data/some_pdf_directory/ # load data from multiples pdf which under the directory of {some_pdf_directory}
```
### Query
```bash
python query.py --question '{the_question_you_want_to_ask}'
```
## Example
````bash
# Start the es service
docker compose up -d
or
docker-compose up -d
# Set environment variables
export ES_USER=elastic
export ES_PASSWORD=llama_index
export ES_URL=http://127.0.0.1:9200
export DASHSCOPE_API_KEY={some_key}
# Ingest data
python data_ingestion.py example/data/declaration_of_the_rights_of_man_1789.pdf
# Ask a question
python query.py -q 'how about the rights of men'
## outputs
Please answer the question based on the content within ```:
```
I. Men are born, and always continue, free and equal in respect of their rights. Civil distinctions, therefore, can be founded only on public utility.
```
My question is:how about the rights of men。
question: how about the rights of men
answer: The statement implies that men are born free and equal in terms of their rights. Civil distinctions should only be based on public utility. However, it does not specify what those rights are. It is up to society and individual countries to determine and protect the specific rights of their citizens.
````
## Development
`MinerU` provides a `RAG` integration interface, allowing users to specify a single input `pdf` file or a directory. `MinerU` will automatically parse the input files and return an iterable interface for retrieving the data.
### API Interface
```python
from magic_pdf.integrations.rag.type import Node
class RagPageReader:
def get_rel_map(self) -> list[ElementRelation]:
# Retrieve the relationships between nodes
pass
...
class RagDocumentReader:
...
class DataReader:
def __init__(self, path_or_directory: str, method: str, output_dir: str):
pass
def get_documents_count(self) -> int:
"""Get the number of pdf documents"""
pass
def get_document_result(self, idx: int) -> RagDocumentReader | None:
"""Retrieve the parsed content of a specific pdf"""
pass
def get_document_filename(self, idx: int) -> Path:
"""Retrieve the path of a specific pdf"""
pass
```
Type Definitions
```python
class Node(BaseModel):
category_type: CategoryType = Field(description='Category') # Category
text: str | None = Field(description='Text content', default=None)
image_path: str | None = Field(description='Path to image or table (table may be stored as an image)', default=None)
anno_id: int = Field(description='Unique ID', default=-1)
latex: str | None = Field(description='LaTeX output for equations or tables', default=None)
html: str | None = Field(description='HTML output for tables', default=None)
```
Tables can be stored in one of three formats: image, LaTeX, or HTML.
`anno_id` is a globally unique ID for each Node. It can be used later to match this Node with other Nodes. The relationships between nodes can be retrieved using the `get_rel_map` method. Users can use `anno_id` to link nodes and construct a RAG index that includes node relationships.
### Node Relationship Matrix
| | image_body | table_body |
| -------------- | ---------- | ---------- |
| image_caption | sibling | |
| table_caption | | sibling |
| table_footnote | | sibling |
<details open="open">
<summary><h2 style="display: inline-block">目录</h2></summary>
<li><a href="#介绍">介绍</a></li>
<li><a href="#安装">安装</a></li>
<li><a href="#示例">示例</a></li>
<li><a href="#开发">开发</a></li>
</ol>
</details>
## 介绍
`MinerU` 提供数据 `API接口` 以支持用户导入数据到 `RAG` 系统。本项目将基于`通义千问`展示如何构建一个轻量级的 `RAG` 系统。
<p align="center">
<img src="rag_data_api.png" width="300px" style="vertical-align:middle;">
</p>
## 安装
环境要求
```text
NVIDIA A100 80GB,
Centos 7 3.10.0-957.el7.x86_64
Client: Docker Engine - Community
Version: 24.0.5
API version: 1.43
Go version: go1.20.6
Git commit: ced0996
Built: Fri Jul 21 20:39:02 2023
OS/Arch: linux/amd64
Context: default
Server: Docker Engine - Community
Engine:
Version: 24.0.5
API version: 1.43 (minimum version 1.12)
Go version: go1.20.6
Git commit: a61e2b4
Built: Fri Jul 21 20:38:05 2023
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.6.25
GitCommit: d8f198a4ed8892c764191ef7b3b06d8a2eeb5c7f
runc:
Version: 1.1.10
GitCommit: v1.1.10-0-g18a0cb0
docker-init:
Version: 0.19.0
GitCommit: de40ad0
```
请参考[文档](../../README_zh-CN.md) 安装 MinerU
第三方软件
```bash
# install
pip install modelscope==1.14.0
pip install llama-index-vector-stores-elasticsearch==0.2.0
pip install llama-index-embeddings-dashscope==0.2.0
pip install llama-index-core==0.10.68
pip install einops==0.7.0
pip install transformers-stream-generator==0.0.5
pip install accelerate==0.33.0
# uninstall
pip uninstall transformer-engine
```
## 示例
````bash
cd projects/llama_index_rag
docker compose up -d
or
docker-compose up -d
# 配置环境变量
export ES_USER=elastic
export ES_PASSWORD=llama_index
export ES_URL=http://127.0.0.1:9200
export DASHSCOPE_API_KEY={some_key}
DASHSCOPE_API_KEY 开通参考[文档](https://help.aliyun.com/zh/dashscope/opening-service)
# 未导入数据,查询问题。返回通义千问默认答案
python query.py -q 'how about the rights of men'
## outputs
question: how about the rights of men
answer: The topic of men's rights often refers to discussions around legal, social, and political issues that affect men specifically or differently from women. Movements related to men's rights advocate for addressing areas where men face discrimination or unique challenges, such as:
Child Custody: Ensuring that men have equal opportunities for custody of their children following divorce or separation.
Domestic Violence: Recognizing that men can also be victims of domestic abuse and ensuring they have access to support services.
Mental Health and Suicide Rates: Addressing the higher rates of suicide among men and providing mental health resources.
Military Conscription: In some countries, only men are required to register for military service, which is seen as a gender-based obligation.
Workplace Safety: Historically, more men than women have been employed in high-risk occupations, leading to higher workplace injury and death rates.
Parental Leave: Advocating for paternity leave policies that allow men to take time off work for family care.
Men's rights activism often intersects with broader discussions on gender equality and aims to promote fairness and equity across genders. It's important to note that while advocating for these issues, it should be done in a way that does not detract from or oppose the goals of gender equality and the rights of other groups. The focus should be on creating a fair society where everyone has equal opportunities and protections under the law.
# 导入数据
python data_ingestion.py -p example/data/
or
python data_ingestion.py -p example/data/declaration_of_the_rights_of_man_1789.pdf
# 导入数据后,查询问题。通义千问模型会根据 RAG 系统的检索结果,结合上下文,给出答案。
python query.py -q 'how about the rights of men'
## outputs
请基于```内的内容回答问题。"
```
I. Men are born, and always continue, free and equal in respect of their rights. Civil distinctions, therefore, can be founded only on public utility.
```
我的问题是:how about the rights of men。
question: how about the rights of men
answer: The statement implies that men are born free and equal in terms of their rights. Civil distinctions should only be based on public utility. However, it does not specify what those rights are. It is up to society and individual countries to determine and protect the specific rights of their citizens.
````
## 开发
`MinerU` 提供了 `RAG` 集成接口,用户可以通过指定输入单个 `pdf` 文件或者某个目录。`MinerU` 会自动解析输入文件并返回可以迭代的接口用于获取数据
### API 接口
```python
from magic_pdf.integrations.rag.type import Node
class RagPageReader:
def get_rel_map(self) -> list[ElementRelation]:
# 获取节点间的关系
pass
...
class RagDocumentReader:
...
class DataReader:
def __init__(self, path_or_directory: str, method: str, output_dir: str):
pass
def get_documents_count(self) -> int:
"""获取 pdf 文档数量"""
pass
def get_document_result(self, idx: int) -> RagDocumentReader | None:
"""获取某个 pdf 的解析内容"""
pass
def get_document_filename(self, idx: int) -> Path:
"""获取某个 pdf 的具体路径"""
pass
```
类型定义
```python
class Node(BaseModel):
category_type: CategoryType = Field(description='类别') # 类别
text: str | None = Field(description='文本内容',
default=None)
image_path: str | None = Field(description='图或者表格(表可能用图片形式存储)的存储路径',
default=None)
anno_id: int = Field(description='unique id', default=-1)
latex: str | None = Field(description='公式或表格 latex 解析结果', default=None)
html: str | None = Field(description='表格的 html 解析结果', default=None)
```
表格存储形式可能会是 图片、latex、html 三种形式之一。
anno_id 是该 Node 的在全局唯一ID。后续可以用于匹配该 Node 和其他 Node 的关系。节点的关系可以通过方法 `get_rel_map` 获取。用户可以用 `anno_id` 匹配节点之间的关系,并用于构建具备节点的关系的 rag index。
### 节点类型关系矩阵
| | image_body | table_body |
| -------------- | ---------- | ---------- |
| image_caption | sibling | |
| table_caption | | sibling |
| table_footnote | | sibling |
import os
import click
from llama_index.core.schema import TextNode
from llama_index.embeddings.dashscope import (DashScopeEmbedding,
DashScopeTextEmbeddingModels,
DashScopeTextEmbeddingType)
from llama_index.vector_stores.elasticsearch import ElasticsearchStore
from magic_pdf.integrations.rag.api import DataReader
es_vec_store = ElasticsearchStore(
index_name='rag_index',
es_url=os.getenv('ES_URL', 'http://127.0.0.1:9200'),
es_user=os.getenv('ES_USER', 'elastic'),
es_password=os.getenv('ES_PASSWORD', 'llama_index'),
)
# Create embeddings
# text_type=`document` to build index
def embed_node(node):
embedder = DashScopeEmbedding(
model_name=DashScopeTextEmbeddingModels.TEXT_EMBEDDING_V2,
text_type=DashScopeTextEmbeddingType.TEXT_TYPE_DOCUMENT,
)
result_embeddings = embedder.get_text_embedding(node.text)
node.embedding = result_embeddings
return node
@click.command()
@click.option(
'-p',
'--path',
'path',
type=click.Path(exists=True),
required=True,
help='local pdf filepath or directory',
)
def cli(path):
output_dir = '/tmp/magic_pdf/integrations/rag/'
os.makedirs(output_dir, exist_ok=True)
documents = DataReader(path, 'ocr', output_dir)
# build nodes
nodes = []
for idx in range(documents.get_documents_count()):
doc = documents.get_document_result(idx)
if doc is None: # something wrong happens when parse pdf !
continue
for page in iter(
doc): # iterate documents from initial page to last page !
for element in iter(page): # iterate the element from all page !
if element.text is None:
continue
nodes.append(
embed_node(
TextNode(text=element.text,
metadata={'purpose': 'demo'})))
es_vec_store.add(nodes)
if __name__ == '__main__':
cli()
services:
es:
container_name: es
image: docker.elastic.co/elasticsearch/elasticsearch:8.11.3
volumes:
- esdata01:/usr/share/elasticsearch/data
ports:
- 9200:9200
environment:
- node.name=es
- ELASTIC_PASSWORD=llama_index
- bootstrap.memory_lock=false
- discovery.type=single-node
- xpack.security.enabled=true
- xpack.security.http.ssl.enabled=false
- xpack.security.transport.ssl.enabled=false
ulimits:
memlock:
soft: -1
hard: -1
restart: always
volumes:
esdata01:
driver: local
import os
import click
from llama_index.core.vector_stores.types import VectorStoreQuery
from llama_index.embeddings.dashscope import (DashScopeEmbedding,
DashScopeTextEmbeddingModels,
DashScopeTextEmbeddingType)
from llama_index.vector_stores.elasticsearch import (AsyncDenseVectorStrategy,
ElasticsearchStore)
# initialize qwen 7B model
from modelscope import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
es_vector_store = ElasticsearchStore(
index_name='rag_index',
es_url=os.getenv('ES_URL', 'http://127.0.0.1:9200'),
es_user=os.getenv('ES_USER', 'elastic'),
es_password=os.getenv('ES_PASSWORD', 'llama_index'),
retrieval_strategy=AsyncDenseVectorStrategy(),
)
def embed_text(text):
embedder = DashScopeEmbedding(
model_name=DashScopeTextEmbeddingModels.TEXT_EMBEDDING_V2,
text_type=DashScopeTextEmbeddingType.TEXT_TYPE_DOCUMENT,
)
return embedder.get_text_embedding(text)
def search(vector_store: ElasticsearchStore, query: str):
query_vec = VectorStoreQuery(query_embedding=embed_text(query))
result = vector_store.query(query_vec)
return '\n'.join([node.text for node in result.nodes])
@click.command()
@click.option(
'-q',
'--question',
'question',
required=True,
help='ask what you want to know!',
)
def cli(question):
tokenizer = AutoTokenizer.from_pretrained('qwen/Qwen-7B-Chat',
revision='v1.0.5',
trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained('qwen/Qwen-7B-Chat',
revision='v1.0.5',
device_map='auto',
trust_remote_code=True,
fp32=True).eval()
model.generation_config = GenerationConfig.from_pretrained(
'Qwen/Qwen-7B-Chat', revision='v1.0.5', trust_remote_code=True)
# define a prompt template for the vectorDB-enhanced LLM generation
def answer_question(question, context, model):
if context == '':
prompt = question
else:
prompt = f'''请基于```内的内容回答问题。"
```
{context}
```
我的问题是:{question}
'''
history = None
print(prompt)
response, history = model.chat(tokenizer, prompt, history=None)
return response
answer = answer_question(question, search(es_vector_store, question),
model)
print(f'question: {question}\n'
f'answer: {answer}')
"""
python query.py -q 'how about the rights of men'
"""
if __name__ == '__main__':
cli()
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment