Commit 0fc1daac authored by luopl's avatar luopl
Browse files

Initial commit

parents
# 常见问题解答
### 1.在较新版本的mac上使用命令安装pip install magic-pdf\[full\] zsh: no matches found: magic-pdf\[full\]
在 macOS 上,默认的 shell 从 Bash 切换到了 Z shell,而 Z shell 对于某些类型的字符串匹配有特殊的处理逻辑,这可能导致no matches found错误。
可以通过在命令行禁用globbing特性,再尝试运行安装命令
```bash
setopt no_nomatch
pip install magic-pdf[full]
```
### 2.使用过程中遇到_pickle.UnpicklingError: invalid load key, 'v'.错误
可能是由于模型文件未下载完整导致,可尝试重新下载模型文件后再试
参考:https://github.com/opendatalab/MinerU/issues/143
### 3.模型文件应该下载到哪里/models-dir的配置应该怎么填
模型文件的路径输入是在"magic-pdf.json"中通过
```json
{
"models-dir": "/tmp/models"
}
```
进行配置的。
这个路径是绝对路径而不是相对路径,绝对路径的获取可在models目录中通过命令 "pwd" 获取。
参考:https://github.com/opendatalab/MinerU/issues/155#issuecomment-2230216874
### 4.在WSL2的Ubuntu22.04中遇到报错`ImportError: libGL.so.1: cannot open shared object file: No such file or directory`
WSL2的Ubuntu22.04中缺少`libgl`库,可通过以下命令安装`libgl`库解决:
```bash
sudo apt-get install libgl1-mesa-glx
```
参考:https://github.com/opendatalab/MinerU/issues/388
### 5.遇到报错 `ModuleNotFoundError : Nomodulenamed 'fairscale'`
需要卸载该模块并重新安装
```bash
pip uninstall fairscale
pip install fairscale
```
参考:https://github.com/opendatalab/MinerU/issues/411
### 6.在部分较新的设备如H100上,使用CUDA加速OCR时解析出的文字乱码。
cuda11对新显卡的兼容性不好,需要升级paddle使用的cuda版本
```bash
pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu123/
```
参考:https://github.com/opendatalab/MinerU/issues/558
### 7.在部分Linux服务器上,程序一运行就报错 `非法指令 (核心已转储)` 或 `Illegal instruction (core dumped)`
可能是因为服务器CPU不支持AVX/AVX2指令集,或cpu本身支持但被运维禁用了,可以尝试联系运维解除限制或更换服务器。
参考:https://github.com/opendatalab/MinerU/issues/591 , https://github.com/opendatalab/MinerU/issues/736
### 8.在 CentOS 7 或 Ubuntu 18 系统安装MinerU时报错`ERROR: Failed building wheel for simsimd`
新版本albumentations(1.4.21)引入了依赖simsimd,由于simsimd在linux的预编译包要求glibc的版本大于等于2.28,导致部分2019年之前发布的Linux发行版无法正常安装,可通过如下命令安装:
```
pip install -U magic-pdf[full,old_linux] --extra-index-url https://wheels.myhloli.com
```
参考:https://github.com/opendatalab/MinerU/issues/1004
### 9. 旧显卡如M40出现 "RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED"
在运行过程中(使用CUDA)出现以下错误:
```
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasGemmStridedBatchedEx(handle, opa, opb, (int)m, (int)n, (int)k, (void*)&falpha, a, CUDA_R_16BF, (int)lda, stridea, b, CUDA_R_16BF, (int)ldb, strideb, (void*)&fbeta, c, CUDA_R_16BF, (int)ldc, stridec, (int)num_batches, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)
```
由于Turing架构之前的显卡不支持BF16精度,并且部分显卡未能被PyTorch正确识别,因此需要手动关闭BF16精度。
请找到并修改`pdf_parse_union_core_v2.py`文件中的第287至290行代码(注意:不同版本中位置可能有所不同),原代码如下:
```python
if torch.cuda.is_bf16_supported():
supports_bfloat16 = True
else:
supports_bfloat16 = False
```
将其修改为:
```python
supports_bfloat16 = False
```
参考:https://github.com/opendatalab/MinerU/issues/1508
## Overview
After executing the `magic-pdf` command, in addition to outputting files related to markdown, several other files unrelated to markdown will also be generated. These files will be introduced one by one.
### some_pdf_layout.pdf
Each page layout consists of one or more boxes. The number at the top left of each box indicates its sequence number. Additionally, in `layout.pdf`, different content blocks are highlighted with different background colors.
![layout example](images/layout_example.png)
### some_pdf_spans.pdf
All spans on the page are drawn with different colored line frames according to the span type. This file can be used for quality control, allowing for quick identification of issues such as missing text or unrecognized inline formulas.
![spans example](images/spans_example.png)
### some_pdf_model.json
#### Structure Definition
```python
from pydantic import BaseModel, Field
from enum import IntEnum
class CategoryType(IntEnum):
title = 0 # Title
plain_text = 1 # Text
abandon = 2 # Includes headers, footers, page numbers, and page annotations
figure = 3 # Image
figure_caption = 4 # Image description
table = 5 # Table
table_caption = 6 # Table description
table_footnote = 7 # Table footnote
isolate_formula = 8 # Block formula
formula_caption = 9 # Formula label
embedding = 13 # Inline formula
isolated = 14 # Block formula
text = 15 # OCR recognition result
class PageInfo(BaseModel):
page_no: int = Field(description="Page number, the first page is 0", ge=0)
height: int = Field(description="Page height", gt=0)
width: int = Field(description="Page width", ge=0)
class ObjectInferenceResult(BaseModel):
category_id: CategoryType = Field(description="Category", ge=0)
poly: list[float] = Field(description="Quadrilateral coordinates, representing the coordinates of the top-left, top-right, bottom-right, and bottom-left points respectively")
score: float = Field(description="Confidence of the inference result")
latex: str | None = Field(description="LaTeX parsing result", default=None)
html: str | None = Field(description="HTML parsing result", default=None)
class PageInferenceResults(BaseModel):
layout_dets: list[ObjectInferenceResult] = Field(description="Page recognition results", ge=0)
page_info: PageInfo = Field(description="Page metadata")
# The inference results of all pages, ordered by page number, are stored in a list as the inference results of MinerU
inference_result: list[PageInferenceResults] = []
```
The format of the poly coordinates is \[x0, y0, x1, y1, x2, y2, x3, y3\], representing the coordinates of the top-left, top-right, bottom-right, and bottom-left points respectively.
![Poly Coordinate Diagram](images/poly.png)
#### example
```json
[
{
"layout_dets": [
{
"category_id": 2,
"poly": [
99.1906967163086,
100.3119125366211,
730.3707885742188,
100.3119125366211,
730.3707885742188,
245.81326293945312,
99.1906967163086,
245.81326293945312
],
"score": 0.9999997615814209
}
],
"page_info": {
"page_no": 0,
"height": 2339,
"width": 1654
}
},
{
"layout_dets": [
{
"category_id": 5,
"poly": [
99.13092803955078,
2210.680419921875,
497.3183898925781,
2210.680419921875,
497.3183898925781,
2264.78076171875,
99.13092803955078,
2264.78076171875
],
"score": 0.9999997019767761
}
],
"page_info": {
"page_no": 1,
"height": 2339,
"width": 1654
}
}
]
```
### some_pdf_middle.json
| Field Name | Description |
| :------------- | :------------------------------------------------------------------------------------------------------------- |
| pdf_info | list, each element is a dict representing the parsing result of each PDF page, see the table below for details |
| \_parse_type | ocr \| txt, used to indicate the mode used in this intermediate parsing state |
| \_version_name | string, indicates the version of magic-pdf used in this parsing |
<br>
**pdf_info**
Field structure description
| Field Name | Description |
| :------------------ | :----------------------------------------------------------------------------------------------------------------- |
| preproc_blocks | Intermediate result after PDF preprocessing, not yet segmented |
| layout_bboxes | Layout segmentation results, containing layout direction (vertical, horizontal), and bbox, sorted by reading order |
| page_idx | Page number, starting from 0 |
| page_size | Page width and height |
| \_layout_tree | Layout tree structure |
| images | list, each element is a dict representing an img_block |
| tables | list, each element is a dict representing a table_block |
| interline_equations | list, each element is a dict representing an interline_equation_block |
| discarded_blocks | List, block information returned by the model that needs to be dropped |
| para_blocks | Result after segmenting preproc_blocks |
In the above table, `para_blocks` is an array of dicts, each dict representing a block structure. A block can support up to one level of nesting.
<br>
**block**
The outer block is referred to as a first-level block, and the fields in the first-level block include:
| Field Name | Description |
| :--------- | :------------------------------------------------------------- |
| type | Block type (table\|image) |
| bbox | Block bounding box coordinates |
| blocks | list, each element is a dict representing a second-level block |
<br>
There are only two types of first-level blocks: "table" and "image". All other blocks are second-level blocks.
The fields in a second-level block include:
| Field Name | Description |
| :--------- | :---------------------------------------------------------------------------------------------------------- |
| type | Block type |
| bbox | Block bounding box coordinates |
| lines | list, each element is a dict representing a line, used to describe the composition of a line of information |
Detailed explanation of second-level block types
| type | Description |
| :----------------- | :--------------------- |
| image_body | Main body of the image |
| image_caption | Image description text |
| image_footnote | Image footnote |
| table_body | Main body of the table |
| table_caption | Table description text |
| table_footnote | Table footnote |
| text | Text block |
| title | Title block |
| index | Index block |
| list | List block |
| interline_equation | Block formula |
<br>
**line**
The field format of a line is as follows:
| Field Name | Description |
| :--------- | :------------------------------------------------------------------------------------------------------ |
| bbox | Bounding box coordinates of the line |
| spans | list, each element is a dict representing a span, used to describe the composition of the smallest unit |
<br>
**span**
| Field Name | Description |
| :------------------ | :------------------------------------------------------------------------------------------------------- |
| bbox | Bounding box coordinates of the span |
| type | Type of the span |
| content \| img_path | Text spans use content, chart spans use img_path to store the actual text or screenshot path information |
The types of spans are as follows:
| type | Description |
| :----------------- | :------------- |
| image | Image |
| table | Table |
| text | Text |
| inline_equation | Inline formula |
| interline_equation | Block formula |
**Summary**
A span is the smallest storage unit for all elements.
The elements stored within para_blocks are block information.
The block structure is as follows:
First-level block (if any) -> Second-level block -> Line -> Span
#### example
```json
{
"pdf_info": [
{
"preproc_blocks": [
{
"type": "text",
"bbox": [
52,
61.956024169921875,
294,
82.99800872802734
],
"lines": [
{
"bbox": [
52,
61.956024169921875,
294,
72.0000228881836
],
"spans": [
{
"bbox": [
54.0,
61.956024169921875,
296.2261657714844,
72.0000228881836
],
"content": "dependent on the service headway and the reliability of the departure ",
"type": "text",
"score": 1.0
}
]
}
]
}
],
"layout_bboxes": [
{
"layout_bbox": [
52,
61,
294,
731
],
"layout_label": "V",
"sub_layout": []
}
],
"page_idx": 0,
"page_size": [
612.0,
792.0
],
"_layout_tree": [],
"images": [],
"tables": [],
"interline_equations": [],
"discarded_blocks": [],
"para_blocks": [
{
"type": "text",
"bbox": [
52,
61.956024169921875,
294,
82.99800872802734
],
"lines": [
{
"bbox": [
52,
61.956024169921875,
294,
72.0000228881836
],
"spans": [
{
"bbox": [
54.0,
61.956024169921875,
296.2261657714844,
72.0000228881836
],
"content": "dependent on the service headway and the reliability of the departure ",
"type": "text",
"score": 1.0
}
]
}
]
}
]
}
],
"_parse_type": "txt",
"_version_name": "0.6.1"
}
```
## 概览
`magic-pdf` 命令执行后除了输出和 markdown 有关的文件以外,还会生成若干个和 markdown 无关的文件。现在将一一介绍这些文件
### some_pdf_layout.pdf
每一页的 layout 均由一个或多个框组成。 每个框左上脚的数字表明它们的序号。此外 layout.pdf 框内用不同的背景色块圈定不同的内容块。
![layout 页面示例](images/layout_example.png)
### some_pdf_spans.pdf
根据 span 类型的不同,采用不同颜色线框绘制页面上所有 span。该文件可以用于质检,可以快速排查出文本丢失、行间公式未识别等问题。
![span 页面示例](images/spans_example.png)
### some_pdf_model.json
#### 结构定义
```python
from pydantic import BaseModel, Field
from enum import IntEnum
class CategoryType(IntEnum):
title = 0 # 标题
plain_text = 1 # 文本
abandon = 2 # 包括页眉页脚页码和页面注释
figure = 3 # 图片
figure_caption = 4 # 图片描述
table = 5 # 表格
table_caption = 6 # 表格描述
table_footnote = 7 # 表格注释
isolate_formula = 8 # 行间公式
formula_caption = 9 # 行间公式的标号
embedding = 13 # 行内公式
isolated = 14 # 行间公式
text = 15 # ocr 识别结果
class PageInfo(BaseModel):
page_no: int = Field(description="页码序号,第一页的序号是 0", ge=0)
height: int = Field(description="页面高度", gt=0)
width: int = Field(description="页面宽度", ge=0)
class ObjectInferenceResult(BaseModel):
category_id: CategoryType = Field(description="类别", ge=0)
poly: list[float] = Field(description="四边形坐标, 分别是 左上,右上,右下,左下 四点的坐标")
score: float = Field(description="推理结果的置信度")
latex: str | None = Field(description="latex 解析结果", default=None)
html: str | None = Field(description="html 解析结果", default=None)
class PageInferenceResults(BaseModel):
layout_dets: list[ObjectInferenceResult] = Field(description="页面识别结果", ge=0)
page_info: PageInfo = Field(description="页面元信息")
# 所有页面的推理结果按照页码顺序依次放到列表中即为 minerU 推理结果
inference_result: list[PageInferenceResults] = []
```
poly 坐标的格式 \[x0, y0, x1, y1, x2, y2, x3, y3\], 分别表示左上、右上、右下、左下四点的坐标
![poly 坐标示意图](images/poly.png)
#### 示例数据
```json
[
{
"layout_dets": [
{
"category_id": 2,
"poly": [
99.1906967163086,
100.3119125366211,
730.3707885742188,
100.3119125366211,
730.3707885742188,
245.81326293945312,
99.1906967163086,
245.81326293945312
],
"score": 0.9999997615814209
}
],
"page_info": {
"page_no": 0,
"height": 2339,
"width": 1654
}
},
{
"layout_dets": [
{
"category_id": 5,
"poly": [
99.13092803955078,
2210.680419921875,
497.3183898925781,
2210.680419921875,
497.3183898925781,
2264.78076171875,
99.13092803955078,
2264.78076171875
],
"score": 0.9999997019767761
}
],
"page_info": {
"page_no": 1,
"height": 2339,
"width": 1654
}
}
]
```
### some_pdf_middle.json
| 字段名 | 解释 |
| :------------- | :----------------------------------------------------------------- |
| pdf_info | list,每个元素都是一个dict,这个dict是每一页pdf的解析结果,详见下表 |
| \_parse_type | ocr \| txt,用来标识本次解析的中间态使用的模式 |
| \_version_name | string, 表示本次解析使用的 magic-pdf 的版本号 |
<br>
**pdf_info**
字段结构说明
| 字段名 | 解释 |
| :------------------ | :------------------------------------------------------------------- |
| preproc_blocks | pdf预处理后,未分段的中间结果 |
| layout_bboxes | 布局分割的结果,含有布局的方向(垂直、水平),和bbox,按阅读顺序排序 |
| page_idx | 页码,从0开始 |
| page_size | 页面的宽度和高度 |
| \_layout_tree | 布局树状结构 |
| images | list,每个元素是一个dict,每个dict表示一个img_block |
| tables | list,每个元素是一个dict,每个dict表示一个table_block |
| interline_equations | list,每个元素是一个dict,每个dict表示一个interline_equation_block |
| discarded_blocks | List, 模型返回的需要drop的block信息 |
| para_blocks | 将preproc_blocks进行分段之后的结果 |
上表中 `para_blocks` 是个dict的数组,每个dict是一个block结构,block最多支持一次嵌套
<br>
**block**
外层block被称为一级block,一级block中的字段包括
| 字段名 | 解释 |
| :----- | :---------------------------------------------- |
| type | block类型(table\|image) |
| bbox | block矩形框坐标 |
| blocks | list,里面的每个元素都是一个dict格式的二级block |
<br>
一级block只有"table"和"image"两种类型,其余block均为二级block
二级block中的字段包括
| 字段名 | 解释 |
| :----- | :----------------------------------------------------------- |
| type | block类型 |
| bbox | block矩形框坐标 |
| lines | list,每个元素都是一个dict表示的line,用来描述一行信息的构成 |
二级block的类型详解
| type | desc |
| :----------------- | :------------- |
| image_body | 图像的本体 |
| image_caption | 图像的描述文本 |
| image_footnote | 图像的脚注 |
| table_body | 表格本体 |
| table_caption | 表格的描述文本 |
| table_footnote | 表格的脚注 |
| text | 文本块 |
| title | 标题块 |
| index | 目录块 |
| list | 列表块 |
| interline_equation | 行间公式块 |
<br>
**line**
line 的 字段格式如下
| 字段名 | 解释 |
| :----- | :------------------------------------------------------------------- |
| bbox | line的矩形框坐标 |
| spans | list,每个元素都是一个dict表示的span,用来描述一个最小组成单元的构成 |
<br>
**span**
| 字段名 | 解释 |
| :------------------ | :------------------------------------------------------------------------------- |
| bbox | span的矩形框坐标 |
| type | span的类型 |
| content \| img_path | 文本类型的span使用content,图表类使用img_path 用来存储实际的文本或者截图路径信息 |
span 的类型有如下几种
| type | desc |
| :----------------- | :------- |
| image | 图片 |
| table | 表格 |
| text | 文本 |
| inline_equation | 行内公式 |
| interline_equation | 行间公式 |
**总结**
span是所有元素的最小存储单元
para_blocks内存储的元素为区块信息
区块结构为
一级block(如有)->二级block->line->span
#### 示例数据
```json
{
"pdf_info": [
{
"preproc_blocks": [
{
"type": "text",
"bbox": [
52,
61.956024169921875,
294,
82.99800872802734
],
"lines": [
{
"bbox": [
52,
61.956024169921875,
294,
72.0000228881836
],
"spans": [
{
"bbox": [
54.0,
61.956024169921875,
296.2261657714844,
72.0000228881836
],
"content": "dependent on the service headway and the reliability of the departure ",
"type": "text",
"score": 1.0
}
]
}
]
}
],
"layout_bboxes": [
{
"layout_bbox": [
52,
61,
294,
731
],
"layout_label": "V",
"sub_layout": []
}
],
"page_idx": 0,
"page_size": [
612.0,
792.0
],
"_layout_tree": [],
"images": [],
"tables": [],
"interline_equations": [],
"discarded_blocks": [],
"para_blocks": [
{
"type": "text",
"bbox": [
52,
61.956024169921875,
294,
82.99800872802734
],
"lines": [
{
"bbox": [
52,
61.956024169921875,
294,
72.0000228881836
],
"spans": [
{
"bbox": [
54.0,
61.956024169921875,
296.2261657714844,
72.0000228881836
],
"content": "dependent on the service headway and the reliability of the departure ",
"type": "text",
"score": 1.0
}
]
}
]
}
]
}
],
"_parse_type": "txt",
"_version_name": "0.6.1"
}
```
icon.png

61 KB

{
"bucket_info":{
"bucket-name-1":["ak", "sk", "endpoint"],
"bucket-name-2":["ak", "sk", "endpoint"]
},
"latex-delimiter-config": {
"display": {
"left": "$$",
"right": "$$"
},
"inline": {
"left": "$",
"right": "$"
}
},
"llm-aided-config": {
"title_aided": {
"api_key": "your_api_key",
"base_url": "https://dashscope.aliyuncs.com/compatible-mode/v1",
"model": "qwen2.5-32b-instruct",
"enable": false
}
},
"models-dir": {
"pipeline": "",
"vlm": ""
},
"config_version": "1.3.0"
}
\ No newline at end of file
# Copyright (c) Opendatalab. All rights reserved.
# Copyright (c) Opendatalab. All rights reserved.
# Copyright (c) Opendatalab. All rights reserved.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment