Unverified Commit 41d96cd8 authored by Xiaomeng Zhao's avatar Xiaomeng Zhao Committed by GitHub
Browse files

Merge pull request #2065 from opendatalab/release-1.3.0

Release 1.3.0
parents c3d43e52 dd96663c
...@@ -47,6 +47,20 @@ Easier to use: Just grab MinerU Desktop. No coding, no login, just a simple inte ...@@ -47,6 +47,20 @@ Easier to use: Just grab MinerU Desktop. No coding, no login, just a simple inte
</div> </div>
# Changelog # Changelog
- 2025/04/03 Release of version 1.3.0, with many changes in this version:
- Installation and compatibility optimization
- By using paddleocr2torch, completely replaced the paddle framework and paddleocr used in the project, resolving conflicts between paddle and torch.
- Removed the use of layoutlmv3 in layout, solving compatibility issues caused by `detectron2`.
- Extended torch version compatibility to 2.2~2.6.
- CUDA compatibility extended to 11.8~12.6 (CUDA version determined by torch), addressing compatibility issues for some users with 50-series and H-series Nvidia GPUs.
- Python compatible versions extended to 3.10~3.12, resolving the issue of automatic downgrade to 0.6.1 during installation in non-3.10 environments.
- Performance optimization (compared to version 1.0.1, formula parsing speed improved by over 1400%, and overall parsing speed improved by over 500%)
- Improved parsing speed for batch processing of multiple small PDF files ([script example](demo/batch_demo.py)).
- Optimized the loading and usage of the mfr model, reducing memory usage and improving parsing speed. (requires re-executing the [model download process](docs/how_to_download_models_en.md) to obtain incremental updates of model files)
- Optimized memory usage, allowing the project to run with as little as 6GB.
- Improved running speed on mps devices.
- Parsing effect optimization
- Updated the mfr model to unimernet(2503), solving the issue of missing line breaks in multi-line formulas.
- 2025/03/03 1.2.1 released, fixed several bugs: - 2025/03/03 1.2.1 released, fixed several bugs:
- Fixed the impact on punctuation marks during full-width to half-width conversion of letters and numbers - Fixed the impact on punctuation marks during full-width to half-width conversion of letters and numbers
- Fixed caption matching inaccuracies in certain scenarios - Fixed caption matching inaccuracies in certain scenarios
...@@ -215,7 +229,7 @@ There are three different ways to experience MinerU: ...@@ -215,7 +229,7 @@ There are three different ways to experience MinerU:
</tr> </tr>
<tr> <tr>
<td colspan="3">Python Version</td> <td colspan="3">Python Version</td>
<td colspan="3">3.10(Please make sure to create a Python 3.10 virtual environment using conda)</td> <td colspan="3">3.10~3.12</td>
</tr> </tr>
<tr> <tr>
<td colspan="3">Nvidia Driver Version</td> <td colspan="3">Nvidia Driver Version</td>
...@@ -225,8 +239,8 @@ There are three different ways to experience MinerU: ...@@ -225,8 +239,8 @@ There are three different ways to experience MinerU:
</tr> </tr>
<tr> <tr>
<td colspan="3">CUDA Environment</td> <td colspan="3">CUDA Environment</td>
<td>Automatic installation [12.1 (pytorch) + 11.8 (paddle)]</td> <td>11.8/12.4/12.6</td>
<td>11.8 (manual installation) + cuDNN v8.7.0 (manual installation)</td> <td>11.8/12.4/12.6</td>
<td>None</td> <td>None</td>
</tr> </tr>
<tr> <tr>
...@@ -236,11 +250,11 @@ There are three different ways to experience MinerU: ...@@ -236,11 +250,11 @@ There are three different ways to experience MinerU:
<td>None</td> <td>None</td>
</tr> </tr>
<tr> <tr>
<td rowspan="2">GPU Hardware Support List</td> <td rowspan="2">GPU/MPS Hardware Support List</td>
<td colspan="2">GPU VRAM 8GB or more</td> <td colspan="2">GPU VRAM 6GB or more</td>
<td colspan="2">2080~2080Ti / 3060Ti~3090Ti / 4060~4090<br> <td colspan="2">All GPUs with Tensor Cores produced from Volta(2017) onwards.<br>
8G VRAM can enable all acceleration features</td> More than 6GB VRAM </td>
<td rowspan="2">None</td> <td rowspan="2">apple slicon</td>
</tr> </tr>
</table> </table>
...@@ -257,9 +271,9 @@ Synced with dev branch updates: ...@@ -257,9 +271,9 @@ Synced with dev branch updates:
#### 1. Install magic-pdf #### 1. Install magic-pdf
```bash ```bash
conda create -n mineru python=3.10 conda create -n mineru 'python<3.13' -y
conda activate mineru conda activate mineru
pip install -U "magic-pdf[full]" --extra-index-url https://wheels.myhloli.com pip install -U "magic-pdf[full]"
``` ```
#### 2. Download model weight files #### 2. Download model weight files
...@@ -284,7 +298,7 @@ You can modify certain configurations in this file to enable or disable features ...@@ -284,7 +298,7 @@ You can modify certain configurations in this file to enable or disable features
{ {
// other config // other config
"layout-config": { "layout-config": {
"model": "doclayout_yolo" // Please change to "layoutlmv3" when using layoutlmv3. "model": "doclayout_yolo"
}, },
"formula-config": { "formula-config": {
"mfd_model": "yolo_v8_mfd", "mfd_model": "yolo_v8_mfd",
...@@ -292,8 +306,8 @@ You can modify certain configurations in this file to enable or disable features ...@@ -292,8 +306,8 @@ You can modify certain configurations in this file to enable or disable features
"enable": true // The formula recognition feature is enabled by default. If you need to disable it, please change the value here to "false". "enable": true // The formula recognition feature is enabled by default. If you need to disable it, please change the value here to "false".
}, },
"table-config": { "table-config": {
"model": "rapid_table", // Default to using "rapid_table", can be switched to "tablemaster" or "struct_eqtable". "model": "rapid_table",
"sub_model": "slanet_plus", // When the model is "rapid_table", you can choose a sub_model. The options are "slanet_plus" and "unitable" "sub_model": "slanet_plus",
"enable": true, // The table recognition feature is enabled by default. If you need to disable it, please change the value here to "false". "enable": true, // The table recognition feature is enabled by default. If you need to disable it, please change the value here to "false".
"max_time": 400 "max_time": 400
} }
...@@ -308,7 +322,7 @@ If your device supports CUDA and meets the GPU requirements of the mainline envi ...@@ -308,7 +322,7 @@ If your device supports CUDA and meets the GPU requirements of the mainline envi
- [Windows 10/11 + GPU](docs/README_Windows_CUDA_Acceleration_en_US.md) - [Windows 10/11 + GPU](docs/README_Windows_CUDA_Acceleration_en_US.md)
- Quick Deployment with Docker - Quick Deployment with Docker
> [!IMPORTANT] > [!IMPORTANT]
> Docker requires a GPU with at least 8GB of VRAM, and all acceleration features are enabled by default. > Docker requires a GPU with at least 6GB of VRAM, and all acceleration features are enabled by default.
> >
> Before running this Docker, you can use the following command to check if your device supports CUDA acceleration on Docker. > Before running this Docker, you can use the following command to check if your device supports CUDA acceleration on Docker.
> >
...@@ -330,7 +344,7 @@ If your device has NPU acceleration hardware, you can follow the tutorial below ...@@ -330,7 +344,7 @@ If your device has NPU acceleration hardware, you can follow the tutorial below
### Using MPS ### Using MPS
If your device uses Apple silicon chips, you can enable MPS acceleration for certain supported tasks (such as layout detection and formula detection). If your device uses Apple silicon chips, you can enable MPS acceleration for your tasks.
You can enable MPS acceleration by setting the `device-mode` parameter to `mps` in the `magic-pdf.json` configuration file. You can enable MPS acceleration by setting the `device-mode` parameter to `mps` in the `magic-pdf.json` configuration file.
...@@ -341,10 +355,6 @@ You can enable MPS acceleration by setting the `device-mode` parameter to `mps` ...@@ -341,10 +355,6 @@ You can enable MPS acceleration by setting the `device-mode` parameter to `mps`
} }
``` ```
> [!TIP]
> Since the formula recognition task cannot utilize MPS acceleration, you can disable the formula recognition feature in tasks where it is not needed to achieve optimal performance.
>
> You can disable the formula recognition feature by setting the `enable` parameter in the `formula-config` section to `false`.
## Usage ## Usage
...@@ -418,6 +428,8 @@ This project currently uses PyMuPDF to achieve advanced functionality. However, ...@@ -418,6 +428,8 @@ This project currently uses PyMuPDF to achieve advanced functionality. However,
- [StructEqTable](https://github.com/UniModal4Reasoning/StructEqTable-Deploy) - [StructEqTable](https://github.com/UniModal4Reasoning/StructEqTable-Deploy)
- [RapidTable](https://github.com/RapidAI/RapidTable) - [RapidTable](https://github.com/RapidAI/RapidTable)
- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR) - [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
- [RapidOCR](https://github.com/RapidAI/RapidOCR)
- [PaddleOCR2Pytorch](https://github.com/frotms/PaddleOCR2Pytorch)
- [PyMuPDF](https://github.com/pymupdf/PyMuPDF) - [PyMuPDF](https://github.com/pymupdf/PyMuPDF)
- [layoutreader](https://github.com/ppaanngggg/layoutreader) - [layoutreader](https://github.com/ppaanngggg/layoutreader)
- [fast-langdetect](https://github.com/LlmKira/fast-langdetect) - [fast-langdetect](https://github.com/LlmKira/fast-langdetect)
......
...@@ -46,6 +46,21 @@ ...@@ -46,6 +46,21 @@
</div> </div>
# 更新记录 # 更新记录
- 2025/04/03 1.3.0 发布,在这个版本我们做出了许多改变:
- 安装与兼容性优化
- 通过使用paddleocr2torch,完全替代了paddle框架以及paddleocr在项目中的使用,解决了paddle和torch的冲突问题
- 通过移除layout中layoutlmv3的使用,解决了由`detectron2`导致的兼容问题
- torch版本兼容扩展到2.2~2.6
- cuda兼容扩展到11.8~12.6(cuda版本由torch决定),解决部分用户50系显卡与H系显卡的兼容问题
- python兼容版本扩展到3.10~3.12,解决了在非3.10环境下安装时自动降级到0.6.1的问题
- 优化离线部署流程,部署成功后不需要联网下载任何模型文件
- 性能优化(与1.0.1版本相比,公式解析速度最高提升超过1400%,整体解析速度提升超过500%)
- 通过支持多个pdf文件的batch处理([脚本样例](demo/batch_demo.py)),提升了批量小文件的解析速度
- 通过优化mfr模型的加载和使用,降低了显存占用并提升了解析速度(需重新执行[模型下载流程](docs/how_to_download_models_zh_cn.md)以获得模型文件的增量更新)
- 优化显存占用,最低仅需6GB即可运行本项目
- 优化了在mps设备上的运行速度
- 解析效果优化
- mfr模型更新到unimernet(2503),解决多行公式中换行丢失的问题
- 2025/03/03 1.2.1 发布,修复了一些问题: - 2025/03/03 1.2.1 发布,修复了一些问题:
- 修复在字母与数字的全角转半角操作时对标点符号的影响 - 修复在字母与数字的全角转半角操作时对标点符号的影响
- 修复在某些情况下caption的匹配不准确问题 - 修复在某些情况下caption的匹配不准确问题
...@@ -216,7 +231,7 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c ...@@ -216,7 +231,7 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
</tr> </tr>
<tr> <tr>
<td colspan="3">python版本</td> <td colspan="3">python版本</td>
<td colspan="3">3.10 (请务必通过conda创建3.10虚拟环境)</td> <td colspan="3">>=3.9,<=3.12</td>
</tr> </tr>
<tr> <tr>
<td colspan="3">Nvidia Driver 版本</td> <td colspan="3">Nvidia Driver 版本</td>
...@@ -226,8 +241,8 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c ...@@ -226,8 +241,8 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
</tr> </tr>
<tr> <tr>
<td colspan="3">CUDA环境</td> <td colspan="3">CUDA环境</td>
<td>自动安装[12.1(pytorch)+11.8(paddle)]</td> <td>11.8/12.4/12.6</td>
<td>11.8(手动安装)+cuDNN v8.7.0(手动安装)</td> <td>11.8/12.4/12.6</td>
<td>None</td> <td>None</td>
</tr> </tr>
<tr> <tr>
...@@ -237,12 +252,12 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c ...@@ -237,12 +252,12 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
<td>None</td> <td>None</td>
</tr> </tr>
<tr> <tr>
<td rowspan="2">GPU硬件支持列表</td> <td rowspan="2">GPU/MPS 硬件支持列表</td>
<td colspan="2">显存8G以上</td> <td colspan="2">显存6G以上</td>
<td colspan="2"> <td colspan="2">
2080~2080Ti / 3060Ti~3090Ti / 4060~4090<br> Volta(2017)及之后生产的全部带Tensor Core的GPU <br>
8G显存及以上可开启全部加速功能</td> 6G显存及以上</td>
<td rowspan="2">None</td> <td rowspan="2">apple slicon</td>
</tr> </tr>
</table> </table>
...@@ -262,9 +277,9 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c ...@@ -262,9 +277,9 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
> 最新版本国内镜像源同步可能会有延迟,请耐心等待 > 最新版本国内镜像源同步可能会有延迟,请耐心等待
```bash ```bash
conda create -n mineru python=3.10 conda create -n mineru 'python<3.13' -y
conda activate mineru conda activate mineru
pip install -U "magic-pdf[full]" --extra-index-url https://wheels.myhloli.com -i https://mirrors.aliyun.com/pypi/simple pip install -U "magic-pdf[full]" -i https://mirrors.aliyun.com/pypi/simple
``` ```
#### 2. 下载模型权重文件 #### 2. 下载模型权重文件
...@@ -288,7 +303,7 @@ pip install -U "magic-pdf[full]" --extra-index-url https://wheels.myhloli.com -i ...@@ -288,7 +303,7 @@ pip install -U "magic-pdf[full]" --extra-index-url https://wheels.myhloli.com -i
{ {
// other config // other config
"layout-config": { "layout-config": {
"model": "doclayout_yolo" // 使用layoutlmv3请修改为“layoutlmv3" "model": "doclayout_yolo"
}, },
"formula-config": { "formula-config": {
"mfd_model": "yolo_v8_mfd", "mfd_model": "yolo_v8_mfd",
...@@ -296,8 +311,8 @@ pip install -U "magic-pdf[full]" --extra-index-url https://wheels.myhloli.com -i ...@@ -296,8 +311,8 @@ pip install -U "magic-pdf[full]" --extra-index-url https://wheels.myhloli.com -i
"enable": true // 公式识别功能默认是开启的,如果需要关闭请修改此处的值为"false" "enable": true // 公式识别功能默认是开启的,如果需要关闭请修改此处的值为"false"
}, },
"table-config": { "table-config": {
"model": "rapid_table", // 默认使用"rapid_table",可以切换为"tablemaster"和"struct_eqtable" "model": "rapid_table",
"sub_model": "slanet_plus", // 当model为"rapid_table"时,可以自选sub_model,可选项为"slanet_plus"和"unitable" "sub_model": "slanet_plus",
"enable": true, // 表格识别功能默认是开启的,如果需要关闭请修改此处的值为"false" "enable": true, // 表格识别功能默认是开启的,如果需要关闭请修改此处的值为"false"
"max_time": 400 "max_time": 400
} }
...@@ -312,7 +327,7 @@ pip install -U "magic-pdf[full]" --extra-index-url https://wheels.myhloli.com -i ...@@ -312,7 +327,7 @@ pip install -U "magic-pdf[full]" --extra-index-url https://wheels.myhloli.com -i
- [Windows10/11 + GPU](docs/README_Windows_CUDA_Acceleration_zh_CN.md) - [Windows10/11 + GPU](docs/README_Windows_CUDA_Acceleration_zh_CN.md)
- 使用Docker快速部署 - 使用Docker快速部署
> [!IMPORTANT] > [!IMPORTANT]
> Docker 需设备gpu显存大于等于8GB,默认开启所有加速功能 > Docker 需设备gpu显存大于等于6GB,默认开启所有加速功能
> >
> 运行本docker前可以通过以下命令检测自己的设备是否支持在docker上使用CUDA加速 > 运行本docker前可以通过以下命令检测自己的设备是否支持在docker上使用CUDA加速
> >
...@@ -332,7 +347,7 @@ pip install -U "magic-pdf[full]" --extra-index-url https://wheels.myhloli.com -i ...@@ -332,7 +347,7 @@ pip install -U "magic-pdf[full]" --extra-index-url https://wheels.myhloli.com -i
[NPU加速教程](docs/README_Ascend_NPU_Acceleration_zh_CN.md) [NPU加速教程](docs/README_Ascend_NPU_Acceleration_zh_CN.md)
### 使用MPS ### 使用MPS
如果您的设备使用Apple silicon 芯片,您可以在部分支持的任务(layout检测/公式检测)中开启mps加速: 如果您的设备使用Apple silicon 芯片,您可以开启mps加速:
您可以通过在 `magic-pdf.json` 配置文件中将 `device-mode` 参数设置为 `mps` 来启用 MPS 加速。 您可以通过在 `magic-pdf.json` 配置文件中将 `device-mode` 参数设置为 `mps` 来启用 MPS 加速。
...@@ -343,10 +358,6 @@ pip install -U "magic-pdf[full]" --extra-index-url https://wheels.myhloli.com -i ...@@ -343,10 +358,6 @@ pip install -U "magic-pdf[full]" --extra-index-url https://wheels.myhloli.com -i
} }
``` ```
> [!TIP]
> 由于公式识别任务无法开启mps加速,您可在不需要识别公式的任务关闭公式识别功能以获得最佳性能。
>
> 您可以通过将 `formula-config` 部分中的 `enable` 参数设置为 `false` 来禁用公式识别功能。
## 使用 ## 使用
...@@ -422,6 +433,8 @@ TODO ...@@ -422,6 +433,8 @@ TODO
- [StructEqTable](https://github.com/UniModal4Reasoning/StructEqTable-Deploy) - [StructEqTable](https://github.com/UniModal4Reasoning/StructEqTable-Deploy)
- [RapidTable](https://github.com/RapidAI/RapidTable) - [RapidTable](https://github.com/RapidAI/RapidTable)
- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR) - [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
- [RapidOCR](https://github.com/RapidAI/RapidOCR)
- [PaddleOCR2Pytorch](https://github.com/frotms/PaddleOCR2Pytorch)
- [PyMuPDF](https://github.com/pymupdf/PyMuPDF) - [PyMuPDF](https://github.com/pymupdf/PyMuPDF)
- [layoutreader](https://github.com/ppaanngggg/layoutreader) - [layoutreader](https://github.com/ppaanngggg/layoutreader)
- [fast-langdetect](https://github.com/LlmKira/fast-langdetect) - [fast-langdetect](https://github.com/LlmKira/fast-langdetect)
......
import os
from pathlib import Path
from magic_pdf.data.batch_build_dataset import batch_build_dataset
from magic_pdf.tools.common import batch_do_parse
def batch(pdf_dir, output_dir, method, lang):
os.makedirs(output_dir, exist_ok=True)
doc_paths = []
for doc_path in Path(pdf_dir).glob('*'):
if doc_path.suffix == '.pdf':
doc_paths.append(doc_path)
# build dataset with 2 workers
datasets = batch_build_dataset(doc_paths, 4, lang)
# os.environ["MINERU_MIN_BATCH_INFERENCE_SIZE"] = "200" # every 200 pages will be parsed in one batch
batch_do_parse(output_dir, [str(doc_path.stem) for doc_path in doc_paths], datasets, method)
if __name__ == '__main__':
batch("pdfs", "output", "auto", "")
...@@ -7,18 +7,17 @@ from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze ...@@ -7,18 +7,17 @@ from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
from magic_pdf.config.enums import SupportedPdfParseMethod from magic_pdf.config.enums import SupportedPdfParseMethod
# args # args
pdf_file_name = "demo1.pdf" # replace with the real pdf path __dir__ = os.path.dirname(os.path.abspath(__file__))
name_without_suff = pdf_file_name.split(".")[0] pdf_file_name = os.path.join(__dir__, "pdfs", "demo1.pdf") # replace with the real pdf path
name_without_extension = os.path.basename(pdf_file_name).split('.')[0]
# prepare env # prepare env
local_image_dir, local_md_dir = "output/images", "output" local_image_dir = os.path.join(__dir__, "output", name_without_extension, "images")
local_md_dir = os.path.join(__dir__, "output", name_without_extension)
image_dir = str(os.path.basename(local_image_dir)) image_dir = str(os.path.basename(local_image_dir))
os.makedirs(local_image_dir, exist_ok=True) os.makedirs(local_image_dir, exist_ok=True)
image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter( image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter(local_md_dir)
local_md_dir
)
# read bytes # read bytes
reader1 = FileBasedDataReader("") reader1 = FileBasedDataReader("")
...@@ -41,32 +40,29 @@ else: ...@@ -41,32 +40,29 @@ else:
## pipeline ## pipeline
pipe_result = infer_result.pipe_txt_mode(image_writer) pipe_result = infer_result.pipe_txt_mode(image_writer)
### draw model result on each page
infer_result.draw_model(os.path.join(local_md_dir, f"{name_without_suff}_model.pdf"))
### get model inference result ### get model inference result
model_inference_result = infer_result.get_infer_res() model_inference_result = infer_result.get_infer_res()
### draw layout result on each page ### draw layout result on each page
pipe_result.draw_layout(os.path.join(local_md_dir, f"{name_without_suff}_layout.pdf")) pipe_result.draw_layout(os.path.join(local_md_dir, f"{name_without_extension}_layout.pdf"))
### draw spans result on each page ### draw spans result on each page
pipe_result.draw_span(os.path.join(local_md_dir, f"{name_without_suff}_spans.pdf")) pipe_result.draw_span(os.path.join(local_md_dir, f"{name_without_extension}_spans.pdf"))
### get markdown content ### get markdown content
md_content = pipe_result.get_markdown(image_dir) md_content = pipe_result.get_markdown(image_dir)
### dump markdown ### dump markdown
pipe_result.dump_md(md_writer, f"{name_without_suff}.md", image_dir) pipe_result.dump_md(md_writer, f"{name_without_extension}.md", image_dir)
### get content list content ### get content list content
content_list_content = pipe_result.get_content_list(image_dir) content_list_content = pipe_result.get_content_list(image_dir)
### dump content list ### dump content list
pipe_result.dump_content_list(md_writer, f"{name_without_suff}_content_list.json", image_dir) pipe_result.dump_content_list(md_writer, f"{name_without_extension}_content_list.json", image_dir)
### get middle json ### get middle json
middle_json_content = pipe_result.get_middle_json() middle_json_content = pipe_result.get_middle_json()
### dump middle json ### dump middle json
pipe_result.dump_middle_json(md_writer, f'{name_without_suff}_middle.json') pipe_result.dump_middle_json(md_writer, f'{name_without_extension}_middle.json')
...@@ -34,10 +34,9 @@ RUN python3 -m venv /opt/mineru_venv ...@@ -34,10 +34,9 @@ RUN python3 -m venv /opt/mineru_venv
RUN /bin/bash -c "source /opt/mineru_venv/bin/activate && \ RUN /bin/bash -c "source /opt/mineru_venv/bin/activate && \
pip3 install --upgrade pip -i https://mirrors.aliyun.com/pypi/simple && \ pip3 install --upgrade pip -i https://mirrors.aliyun.com/pypi/simple && \
wget https://gcore.jsdelivr.net/gh/opendatalab/MinerU@master/docker/ascend_npu/requirements.txt -O requirements.txt && \ wget https://gcore.jsdelivr.net/gh/opendatalab/MinerU@master/docker/ascend_npu/requirements.txt -O requirements.txt && \
pip3 install -r requirements.txt --extra-index-url https://wheels.myhloli.com -i https://mirrors.aliyun.com/pypi/simple && \ pip3 install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple && \
wget https://gitee.com/ascend/pytorch/releases/download/v6.0.rc2-pytorch2.3.1/torch_npu-2.3.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl && \ wget https://gitee.com/ascend/pytorch/releases/download/v6.0.rc2-pytorch2.3.1/torch_npu-2.3.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl && \
pip3 install torch_npu-2.3.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl && \ pip3 install torch_npu-2.3.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl"
pip3 install https://gcore.jsdelivr.net/gh/myhloli/wheels@main/assets/whl/paddle-custom-npu/paddle_custom_npu-0.0.0-cp310-cp310-linux_aarch64.whl"
# Copy the configuration file template and install magic-pdf latest # Copy the configuration file template and install magic-pdf latest
RUN /bin/bash -c "wget https://gcore.jsdelivr.net/gh/opendatalab/MinerU@master/magic-pdf.template.json && \ RUN /bin/bash -c "wget https://gcore.jsdelivr.net/gh/opendatalab/MinerU@master/magic-pdf.template.json && \
......
boto3>=1.28.43 boto3>=1.28.43
Brotli>=1.1.0 Brotli>=1.1.0
click>=8.1.7 click>=8.1.7
PyMuPDF>=1.24.9,<=1.24.14 PyMuPDF>=1.24.9,<1.25.0
loguru>=0.6.0 loguru>=0.6.0
numpy>=1.21.6,<2.0.0 numpy>=1.21.6,<2.0.0
fast-langdetect>=0.2.3,<0.3.0 fast-langdetect>=0.2.3,<0.3.0
scikit-learn>=1.0.2 scikit-learn>=1.0.2
pdfminer.six==20231228 pdfminer.six==20231228
unimernet==0.2.3 torch==2.3.1
torch>=2.2.2,<=2.3.1 torchvision==0.18.1
torchvision>=0.17.2,<=0.18.1
matplotlib matplotlib
ultralytics>=8.3.48 ultralytics>=8.3.48
paddleocr==2.7.3
paddlepaddle==3.0.0rc1
struct-eqtable==0.3.2
einops
accelerate
rapidocr-paddle>=1.4.5,<2.0.0
rapidocr-onnxruntime>=1.4.4,<2.0.0
rapid-table>=1.0.3,<2.0.0 rapid-table>=1.0.3,<2.0.0
doclayout-yolo==0.0.2b1 doclayout-yolo==0.0.2b1
ftfy
openai openai
detectron2 pydantic>=2.7.2,<2.11
transformers>=4.49.0,<5.0.0
tqdm>=4.67.1
\ No newline at end of file
...@@ -31,8 +31,7 @@ RUN python3 -m venv /opt/mineru_venv ...@@ -31,8 +31,7 @@ RUN python3 -m venv /opt/mineru_venv
RUN /bin/bash -c "source /opt/mineru_venv/bin/activate && \ RUN /bin/bash -c "source /opt/mineru_venv/bin/activate && \
pip3 install --upgrade pip -i https://mirrors.aliyun.com/pypi/simple && \ pip3 install --upgrade pip -i https://mirrors.aliyun.com/pypi/simple && \
wget https://gcore.jsdelivr.net/gh/opendatalab/MinerU@master/docker/china/requirements.txt -O requirements.txt && \ wget https://gcore.jsdelivr.net/gh/opendatalab/MinerU@master/docker/china/requirements.txt -O requirements.txt && \
pip3 install -r requirements.txt --extra-index-url https://wheels.myhloli.com -i https://mirrors.aliyun.com/pypi/simple && \ pip3 install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple"
pip3 install paddlepaddle-gpu==3.0.0rc1 -i https://www.paddlepaddle.org.cn/packages/stable/cu118/"
# Copy the configuration file template and install magic-pdf latest # Copy the configuration file template and install magic-pdf latest
RUN /bin/bash -c "wget https://gcore.jsdelivr.net/gh/opendatalab/MinerU@master/magic-pdf.template.json && \ RUN /bin/bash -c "wget https://gcore.jsdelivr.net/gh/opendatalab/MinerU@master/magic-pdf.template.json && \
......
boto3>=1.28.43 boto3>=1.28.43
Brotli>=1.1.0 Brotli>=1.1.0
click>=8.1.7 click>=8.1.7
PyMuPDF>=1.24.9,<=1.24.14 PyMuPDF>=1.24.9,<1.25.0
loguru>=0.6.0 loguru>=0.6.0
numpy>=1.21.6,<2.0.0 numpy>=1.21.6,<2.0.0
fast-langdetect>=0.2.3,<0.3.0 fast-langdetect>=0.2.3,<0.3.0
scikit-learn>=1.0.2 scikit-learn>=1.0.2
pdfminer.six==20231228 pdfminer.six==20231228
unimernet==0.2.3 torch>=2.2.2,!=2.5.0,!=2.5.1,<=2.6.0
torch>=2.2.2,<=2.3.1 torchvision
torchvision>=0.17.2,<=0.18.1
matplotlib matplotlib
ultralytics>=8.3.48 ultralytics>=8.3.48
paddleocr==2.7.3
struct-eqtable==0.3.2
einops
accelerate
rapidocr-paddle>=1.4.5,<2.0.0
rapidocr-onnxruntime>=1.4.4,<2.0.0
rapid-table>=1.0.3,<2.0.0 rapid-table>=1.0.3,<2.0.0
doclayout-yolo==0.0.2b1 doclayout-yolo==0.0.2b1
ftfy
openai openai
detectron2 pydantic>=2.7.2,<2.11
transformers>=4.49.0,<5.0.0
tqdm>=4.67.1
\ No newline at end of file
...@@ -31,8 +31,7 @@ RUN python3 -m venv /opt/mineru_venv ...@@ -31,8 +31,7 @@ RUN python3 -m venv /opt/mineru_venv
RUN /bin/bash -c "source /opt/mineru_venv/bin/activate && \ RUN /bin/bash -c "source /opt/mineru_venv/bin/activate && \
pip3 install --upgrade pip && \ pip3 install --upgrade pip && \
wget https://github.com/opendatalab/MinerU/raw/master/docker/global/requirements.txt -O requirements.txt && \ wget https://github.com/opendatalab/MinerU/raw/master/docker/global/requirements.txt -O requirements.txt && \
pip3 install -r requirements.txt --extra-index-url https://wheels.myhloli.com && \ pip3 install -r requirements.txt"
pip3 install paddlepaddle-gpu==3.0.0rc1 -i https://www.paddlepaddle.org.cn/packages/stable/cu118/"
# Copy the configuration file template and install magic-pdf latest # Copy the configuration file template and install magic-pdf latest
RUN /bin/bash -c "wget https://github.com/opendatalab/MinerU/raw/master/magic-pdf.template.json && \ RUN /bin/bash -c "wget https://github.com/opendatalab/MinerU/raw/master/magic-pdf.template.json && \
......
boto3>=1.28.43 boto3>=1.28.43
Brotli>=1.1.0 Brotli>=1.1.0
click>=8.1.7 click>=8.1.7
PyMuPDF>=1.24.9,<=1.24.14 PyMuPDF>=1.24.9,<1.25.0
loguru>=0.6.0 loguru>=0.6.0
numpy>=1.21.6,<2.0.0 numpy>=1.21.6,<2.0.0
fast-langdetect>=0.2.3,<0.3.0 fast-langdetect>=0.2.3,<0.3.0
scikit-learn>=1.0.2 scikit-learn>=1.0.2
pdfminer.six==20231228 pdfminer.six==20231228
unimernet==0.2.3 torch>=2.2.2,!=2.5.0,!=2.5.1,<=2.6.0
torch>=2.2.2,<=2.3.1 torchvision
torchvision>=0.17.2,<=0.18.1
matplotlib matplotlib
ultralytics>=8.3.48 ultralytics>=8.3.48
paddleocr==2.7.3
struct-eqtable==0.3.2
einops
accelerate
rapidocr-paddle>=1.4.5,<2.0.0
rapidocr-onnxruntime>=1.4.4,<2.0.0
rapid-table>=1.0.3,<2.0.0 rapid-table>=1.0.3,<2.0.0
doclayout-yolo==0.0.2b1 doclayout-yolo==0.0.2b1
ftfy
openai openai
detectron2 pydantic>=2.7.2,<2.11
transformers>=4.49.0,<5.0.0
tqdm>=4.67.1
\ No newline at end of file
...@@ -9,11 +9,11 @@ nvidia-smi ...@@ -9,11 +9,11 @@ nvidia-smi
If you see information similar to the following, it means that the NVIDIA drivers are already installed, and you can skip Step 2. If you see information similar to the following, it means that the NVIDIA drivers are already installed, and you can skip Step 2.
> [!NOTE] > [!NOTE]
> Notice:`CUDA Version` should be >= 12.1, If the displayed version number is less than 12.1, please upgrade the driver. > Notice:`CUDA Version` should be >= 12.4, If the displayed version number is less than 12.4, please upgrade the driver.
```plaintext ```plaintext
+---------------------------------------------------------------------------------------+ +---------------------------------------------------------------------------------------+
| NVIDIA-SMI 537.34 Driver Version: 537.34 CUDA Version: 12.2 | | NVIDIA-SMI 570.133.07 Driver Version: 572.83 CUDA Version: 12.8 |
|-----------------------------------------+----------------------+----------------------+ |-----------------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC | | GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
...@@ -31,7 +31,7 @@ If no driver is installed, use the following command: ...@@ -31,7 +31,7 @@ If no driver is installed, use the following command:
```sh ```sh
sudo apt-get update sudo apt-get update
sudo apt-get install nvidia-driver-545 sudo apt-get install nvidia-driver-570-server
``` ```
Install the proprietary driver and restart your computer after installation. Install the proprietary driver and restart your computer after installation.
...@@ -53,17 +53,15 @@ In the final step, enter `yes`, close the terminal, and reopen it. ...@@ -53,17 +53,15 @@ In the final step, enter `yes`, close the terminal, and reopen it.
### 4. Create an Environment Using Conda ### 4. Create an Environment Using Conda
Specify Python version 3.10. ```bash
conda create -n mineru 'python<3.13' -y
```sh conda activate mineru
conda create -n MinerU python=3.10
conda activate MinerU
``` ```
### 5. Install Applications ### 5. Install Applications
```sh ```sh
pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com pip install -U magic-pdf[full]
``` ```
> [!IMPORTANT] > [!IMPORTANT]
> After installation, make sure to check the version of `magic-pdf` using the following command: > After installation, make sure to check the version of `magic-pdf` using the following command:
...@@ -72,7 +70,7 @@ pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com ...@@ -72,7 +70,7 @@ pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com
> magic-pdf --version > magic-pdf --version
> ``` > ```
> >
> If the version number is less than 0.7.0, please report the issue. > If the version number is less than 1.3.0, please report the issue.
### 6. Download Models ### 6. Download Models
...@@ -94,13 +92,13 @@ You can find the `magic-pdf.json` file in your user directory. ...@@ -94,13 +92,13 @@ You can find the `magic-pdf.json` file in your user directory.
Download a sample file from the repository and test it. Download a sample file from the repository and test it.
```sh ```sh
wget https://github.com/opendatalab/MinerU/raw/master/demo/small_ocr.pdf wget https://github.com/opendatalab/MinerU/raw/master/demo/pdfs/small_ocr.pdf
magic-pdf -p small_ocr.pdf -o ./output magic-pdf -p small_ocr.pdf -o ./output
``` ```
### 9. Test CUDA Acceleration ### 9. Test CUDA Acceleration
If your graphics card has at least **8GB** of VRAM, follow these steps to test CUDA acceleration: If your graphics card has at least **6GB** of VRAM, follow these steps to test CUDA acceleration:
1. Modify the value of `"device-mode"` in the `magic-pdf.json` configuration file located in your home directory. 1. Modify the value of `"device-mode"` in the `magic-pdf.json` configuration file located in your home directory.
```json ```json
...@@ -111,15 +109,4 @@ If your graphics card has at least **8GB** of VRAM, follow these steps to test C ...@@ -111,15 +109,4 @@ If your graphics card has at least **8GB** of VRAM, follow these steps to test C
2. Test CUDA acceleration with the following command: 2. Test CUDA acceleration with the following command:
```sh ```sh
magic-pdf -p small_ocr.pdf -o ./output magic-pdf -p small_ocr.pdf -o ./output
``` ```
\ No newline at end of file
### 10. Enable CUDA Acceleration for OCR
1. Download `paddlepaddle-gpu`. Installation will automatically enable OCR acceleration.
```sh
python -m pip install paddlepaddle-gpu==3.0.0rc1 -i https://www.paddlepaddle.org.cn/packages/stable/cu118/
```
2. Test OCR acceleration with the following command:
```sh
magic-pdf -p small_ocr.pdf -o ./output
```
...@@ -9,11 +9,11 @@ nvidia-smi ...@@ -9,11 +9,11 @@ nvidia-smi
如果看到类似如下的信息,说明已经安装了nvidia驱动,可以跳过步骤2 如果看到类似如下的信息,说明已经安装了nvidia驱动,可以跳过步骤2
> [!NOTE] > [!NOTE]
> `CUDA Version` 显示的版本号应 >= 12.1,如显示的版本号小于12.1,请升级驱动 > `CUDA Version` 显示的版本号应 >= 12.4,如显示的版本号小于12.4,请升级驱动
```plaintext ```plaintext
+---------------------------------------------------------------------------------------+ +---------------------------------------------------------------------------------------+
| NVIDIA-SMI 537.34 Driver Version: 537.34 CUDA Version: 12.2 | | NVIDIA-SMI 570.133.07 Driver Version: 572.83 CUDA Version: 12.8 |
|-----------------------------------------+----------------------+----------------------+ |-----------------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC | | GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
...@@ -31,7 +31,7 @@ nvidia-smi ...@@ -31,7 +31,7 @@ nvidia-smi
```bash ```bash
sudo apt-get update sudo apt-get update
sudo apt-get install nvidia-driver-545 sudo apt-get install nvidia-driver-570-server
``` ```
安装专有驱动,安装完成后,重启电脑 安装专有驱动,安装完成后,重启电脑
...@@ -53,17 +53,15 @@ bash Anaconda3-2024.06-1-Linux-x86_64.sh ...@@ -53,17 +53,15 @@ bash Anaconda3-2024.06-1-Linux-x86_64.sh
## 4. 使用conda 创建环境 ## 4. 使用conda 创建环境
需指定python版本为3.10
```bash ```bash
conda create -n MinerU python=3.10 conda create -n mineru 'python<3.13' -y
conda activate MinerU conda activate mineru
``` ```
## 5. 安装应用 ## 5. 安装应用
```bash ```bash
pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com -i https://mirrors.aliyun.com/pypi/simple pip install -U magic-pdf[full] -i https://mirrors.aliyun.com/pypi/simple
``` ```
> [!IMPORTANT] > [!IMPORTANT]
...@@ -73,7 +71,7 @@ pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com -i h ...@@ -73,7 +71,7 @@ pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com -i h
> magic-pdf --version > magic-pdf --version
> ``` > ```
> >
> 如果版本号小于0.7.0,请到issue中向我们反馈 > 如果版本号小于1.3.0,请到issue中向我们反馈
## 6. 下载模型 ## 6. 下载模型
...@@ -93,13 +91,13 @@ pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com -i h ...@@ -93,13 +91,13 @@ pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com -i h
从仓库中下载样本文件,并测试 从仓库中下载样本文件,并测试
```bash ```bash
wget https://gcore.jsdelivr.net/gh/opendatalab/MinerU@master/demo/small_ocr.pdf wget https://gcore.jsdelivr.net/gh/opendatalab/MinerU@master/demo/pdfs/small_ocr.pdf
magic-pdf -p small_ocr.pdf -o ./output magic-pdf -p small_ocr.pdf -o ./output
``` ```
## 9. 测试CUDA加速 ## 9. 测试CUDA加速
如果您的显卡显存大于等于 **8GB** ,可以进行以下流程,测试CUDA解析加速效果 如果您的显卡显存大于等于 **6GB** ,可以进行以下流程,测试CUDA解析加速效果
**1.修改【用户目录】中配置文件magic-pdf.json中"device-mode"的值** **1.修改【用户目录】中配置文件magic-pdf.json中"device-mode"的值**
...@@ -115,20 +113,4 @@ magic-pdf -p small_ocr.pdf -o ./output ...@@ -115,20 +113,4 @@ magic-pdf -p small_ocr.pdf -o ./output
magic-pdf -p small_ocr.pdf -o ./output magic-pdf -p small_ocr.pdf -o ./output
``` ```
> [!TIP] > [!TIP]
> CUDA加速是否生效可以根据log中输出的各个阶段cost耗时来简单判断,通常情况下,`layout detection cost` 和 `mfr time` 应提速10倍以上。 > CUDA加速是否生效可以根据log中输出的各个阶段cost耗时来简单判断,通常情况下,使用cuda加速会比cpu更快。
## 10. 为ocr开启cuda加速
**1.下载paddlepaddle-gpu, 安装完成后会自动开启ocr加速**
```bash
python -m pip install paddlepaddle-gpu==3.0.0rc1 -i https://www.paddlepaddle.org.cn/packages/stable/cu118/
```
**2.运行以下命令测试ocr加速效果**
```bash
magic-pdf -p small_ocr.pdf -o ./output
```
> [!TIP]
> CUDA加速是否生效可以根据log中输出的各个阶段cost耗时来简单判断,通常情况下,`ocr cost`应提速10倍以上。
...@@ -2,10 +2,11 @@ ...@@ -2,10 +2,11 @@
### 1. Install CUDA and cuDNN ### 1. Install CUDA and cuDNN
Required versions: CUDA 11.8 + cuDNN 8.7.0 You need to install a CUDA version that is compatible with torch's requirements. Currently, torch supports CUDA 11.8/12.4/12.6.
- CUDA 11.8: https://developer.nvidia.com/cuda-11-8-0-download-archive - CUDA 11.8 https://developer.nvidia.com/cuda-11-8-0-download-archive
- cuDNN v8.7.0 (November 28th, 2022), for CUDA 11.x: https://developer.nvidia.com/rdp/cudnn-archive - CUDA 12.4 https://developer.nvidia.com/cuda-12-4-0-download-archive
- CUDA 12.6 https://developer.nvidia.com/cuda-12-6-0-download-archive
### 2. Install Anaconda ### 2. Install Anaconda
...@@ -15,17 +16,15 @@ Download link: https://repo.anaconda.com/archive/Anaconda3-2024.06-1-Windows-x86 ...@@ -15,17 +16,15 @@ Download link: https://repo.anaconda.com/archive/Anaconda3-2024.06-1-Windows-x86
### 3. Create an Environment Using Conda ### 3. Create an Environment Using Conda
Python version must be 3.10. ```bash
conda create -n mineru 'python<3.13' -y
``` conda activate mineru
conda create -n MinerU python=3.10
conda activate MinerU
``` ```
### 4. Install Applications ### 4. Install Applications
``` ```
pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com pip install -U magic-pdf[full]
``` ```
> [!IMPORTANT] > [!IMPORTANT]
...@@ -35,7 +34,7 @@ pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com ...@@ -35,7 +34,7 @@ pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com
> magic-pdf --version > magic-pdf --version
> ``` > ```
> >
> If the version number is less than 0.7.0, please report it in the issues section. > If the version number is less than 1.3.0, please report it in the issues section.
### 5. Download Models ### 5. Download Models
...@@ -54,18 +53,18 @@ You can find the `magic-pdf.json` file in your 【user directory】 . ...@@ -54,18 +53,18 @@ You can find the `magic-pdf.json` file in your 【user directory】 .
Download a sample file from the repository and test it. Download a sample file from the repository and test it.
```powershell ```powershell
wget https://github.com/opendatalab/MinerU/raw/master/demo/small_ocr.pdf -O small_ocr.pdf wget https://github.com/opendatalab/MinerU/raw/master/demo/pdfs/small_ocr.pdf -O small_ocr.pdf
magic-pdf -p small_ocr.pdf -o ./output magic-pdf -p small_ocr.pdf -o ./output
``` ```
### 8. Test CUDA Acceleration ### 8. Test CUDA Acceleration
If your graphics card has at least 8GB of VRAM, follow these steps to test CUDA-accelerated parsing performance. If your graphics card has at least 6GB of VRAM, follow these steps to test CUDA-accelerated parsing performance.
1. **Overwrite the installation of torch and torchvision** supporting CUDA. 1. **Overwrite the installation of torch and torchvision** supporting CUDA.(Please select the appropriate index-url based on your CUDA version. For more details, refer to the [PyTorch official website](https://pytorch.org/get-started/locally/).)
``` ```
pip install --force-reinstall torch==2.3.1 torchvision==0.18.1 "numpy<2.0.0" --index-url https://download.pytorch.org/whl/cu118 pip install --force-reinstall torch==2.6.0 torchvision==0.21.1 "numpy<2.0.0" --index-url https://download.pytorch.org/whl/cu124
``` ```
2. **Modify the value of `"device-mode"`** in the `magic-pdf.json` configuration file located in your user directory. 2. **Modify the value of `"device-mode"`** in the `magic-pdf.json` configuration file located in your user directory.
...@@ -81,15 +80,4 @@ If your graphics card has at least 8GB of VRAM, follow these steps to test CUDA- ...@@ -81,15 +80,4 @@ If your graphics card has at least 8GB of VRAM, follow these steps to test CUDA-
``` ```
magic-pdf -p small_ocr.pdf -o ./output magic-pdf -p small_ocr.pdf -o ./output
``` ```
\ No newline at end of file
### 9. Enable CUDA Acceleration for OCR
1. **Download paddlepaddle-gpu**, which will automatically enable OCR acceleration upon installation.
```
pip install paddlepaddle-gpu==2.6.1
```
2. **Run the following command to test OCR acceleration**:
```
magic-pdf -p small_ocr.pdf -o ./output
```
...@@ -2,10 +2,11 @@ ...@@ -2,10 +2,11 @@
## 1. 安装cuda和cuDNN ## 1. 安装cuda和cuDNN
需要安装的版本 CUDA 11.8 + cuDNN 8.7.0 需要安装符合torch要求的cuda版本,torch目前支持11.8/12.4/12.6
- CUDA 11.8 https://developer.nvidia.com/cuda-11-8-0-download-archive - CUDA 11.8 https://developer.nvidia.com/cuda-11-8-0-download-archive
- cuDNN v8.7.0 (November 28th, 2022), for CUDA 11.x https://developer.nvidia.com/rdp/cudnn-archive - CUDA 12.4 https://developer.nvidia.com/cuda-12-4-0-download-archive
- CUDA 12.6 https://developer.nvidia.com/cuda-12-6-0-download-archive
## 2. 安装anaconda ## 2. 安装anaconda
...@@ -16,17 +17,15 @@ https://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/Anaconda3-2024.06-1-Window ...@@ -16,17 +17,15 @@ https://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/Anaconda3-2024.06-1-Window
## 3. 使用conda 创建环境 ## 3. 使用conda 创建环境
需指定python版本为3.10
```bash ```bash
conda create -n MinerU python=3.10 conda create -n mineru 'python<3.13' -y
conda activate MinerU conda activate mineru
``` ```
## 4. 安装应用 ## 4. 安装应用
```bash ```bash
pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com -i https://mirrors.aliyun.com/pypi/simple pip install -U magic-pdf[full] -i https://mirrors.aliyun.com/pypi/simple
``` ```
> [!IMPORTANT] > [!IMPORTANT]
...@@ -36,7 +35,7 @@ pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com -i h ...@@ -36,7 +35,7 @@ pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com -i h
> magic-pdf --version > magic-pdf --version
> ``` > ```
> >
> 如果版本号小于0.7.0,请到issue中向我们反馈 > 如果版本号小于 1.3.0 ,请到issue中向我们反馈
## 5. 下载模型 ## 5. 下载模型
...@@ -55,18 +54,18 @@ pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com -i h ...@@ -55,18 +54,18 @@ pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com -i h
从仓库中下载样本文件,并测试 从仓库中下载样本文件,并测试
```powershell ```powershell
wget https://github.com/opendatalab/MinerU/raw/master/demo/small_ocr.pdf -O small_ocr.pdf wget https://github.com/opendatalab/MinerU/raw/master/demo/pdfs/small_ocr.pdf -O small_ocr.pdf
magic-pdf -p small_ocr.pdf -o ./output magic-pdf -p small_ocr.pdf -o ./output
``` ```
## 8. 测试CUDA加速 ## 8. 测试CUDA加速
如果您的显卡显存大于等于 **8GB** ,可以进行以下流程,测试CUDA解析加速效果 如果您的显卡显存大于等于 **6GB** ,可以进行以下流程,测试CUDA解析加速效果
**1.覆盖安装支持cuda的torch和torchvision** **1.覆盖安装支持cuda的torch和torchvision**(请根据cuda版本选择合适的index-url,具体可参考[torch官网](https://pytorch.org/get-started/locally/))
```bash ```bash
pip install --force-reinstall torch==2.3.1 torchvision==0.18.1 "numpy<2.0.0" --index-url https://download.pytorch.org/whl/cu118 pip install --force-reinstall torch==2.6.0 torchvision==0.21.1 "numpy<2.0.0" --index-url https://download.pytorch.org/whl/cu124
``` ```
**2.修改【用户目录】中配置文件magic-pdf.json中"device-mode"的值** **2.修改【用户目录】中配置文件magic-pdf.json中"device-mode"的值**
...@@ -84,20 +83,4 @@ magic-pdf -p small_ocr.pdf -o ./output ...@@ -84,20 +83,4 @@ magic-pdf -p small_ocr.pdf -o ./output
``` ```
> [!TIP] > [!TIP]
> CUDA加速是否生效可以根据log中输出的各个阶段的耗时来简单判断,通常情况下,`layout detection time` 和 `mfr time` 应提速10倍以上。 > CUDA加速是否生效可以根据log中输出的各个阶段的耗时来简单判断,通常情况下,cuda加速后运行速度比cpu更快。
## 9. 为ocr开启cuda加速
**1.下载paddlepaddle-gpu, 安装完成后会自动开启ocr加速**
```bash
pip install paddlepaddle-gpu==2.6.1
```
**2.运行以下命令测试ocr加速效果**
```bash
magic-pdf -p small_ocr.pdf -o ./output
```
> [!TIP]
> CUDA加速是否生效可以根据log中输出的各个阶段cost耗时来简单判断,通常情况下,`ocr time`应提速10倍以上。
...@@ -40,5 +40,5 @@ ...@@ -40,5 +40,5 @@
"enable": false "enable": false
} }
}, },
"config_version": "1.1.1" "config_version": "1.2.0"
} }
\ No newline at end of file
import concurrent.futures
import fitz
from magic_pdf.data.dataset import PymuDocDataset
from magic_pdf.data.utils import fitz_doc_to_image # PyMuPDF
def partition_array_greedy(arr, k):
"""Partition an array into k parts using a simple greedy approach.
Parameters:
-----------
arr : list
The input array of integers
k : int
Number of partitions to create
Returns:
--------
partitions : list of lists
The k partitions of the array
"""
# Handle edge cases
if k <= 0:
raise ValueError('k must be a positive integer')
if k > len(arr):
k = len(arr) # Adjust k if it's too large
if k == 1:
return [list(range(len(arr)))]
if k == len(arr):
return [[i] for i in range(len(arr))]
# Sort the array in descending order
sorted_indices = sorted(range(len(arr)), key=lambda i: arr[i][1], reverse=True)
# Initialize k empty partitions
partitions = [[] for _ in range(k)]
partition_sums = [0] * k
# Assign each element to the partition with the smallest current sum
for idx in sorted_indices:
# Find the partition with the smallest sum
min_sum_idx = partition_sums.index(min(partition_sums))
# Add the element to this partition
partitions[min_sum_idx].append(idx) # Store the original index
partition_sums[min_sum_idx] += arr[idx][1]
return partitions
def process_pdf_batch(pdf_jobs, idx):
"""Process a batch of PDF pages using multiple threads.
Parameters:
-----------
pdf_jobs : list of tuples
List of (pdf_path, page_num) tuples
output_dir : str or None
Directory to save images to
num_threads : int
Number of threads to use
**kwargs :
Additional arguments for process_pdf_page
Returns:
--------
images : list
List of processed images
"""
images = []
for pdf_path, _ in pdf_jobs:
doc = fitz.open(pdf_path)
tmp = []
for page_num in range(len(doc)):
page = doc[page_num]
tmp.append(fitz_doc_to_image(page))
images.append(tmp)
return (idx, images)
def batch_build_dataset(pdf_paths, k, lang=None):
"""Process multiple PDFs by partitioning them into k balanced parts and
processing each part in parallel.
Parameters:
-----------
pdf_paths : list
List of paths to PDF files
k : int
Number of partitions to create
output_dir : str or None
Directory to save images to
threads_per_worker : int
Number of threads to use per worker
**kwargs :
Additional arguments for process_pdf_page
Returns:
--------
all_images : list
List of all processed images
"""
# Get page counts for each PDF
pdf_info = []
total_pages = 0
for pdf_path in pdf_paths:
try:
doc = fitz.open(pdf_path)
num_pages = len(doc)
pdf_info.append((pdf_path, num_pages))
total_pages += num_pages
doc.close()
except Exception as e:
print(f'Error opening {pdf_path}: {e}')
# Partition the jobs based on page countEach job has 1 page
partitions = partition_array_greedy(pdf_info, k)
# Process each partition in parallel
all_images_h = {}
with concurrent.futures.ProcessPoolExecutor(max_workers=k) as executor:
# Submit one task per partition
futures = []
for sn, partition in enumerate(partitions):
# Get the jobs for this partition
partition_jobs = [pdf_info[idx] for idx in partition]
# Submit the task
future = executor.submit(
process_pdf_batch,
partition_jobs,
sn
)
futures.append(future)
# Process results as they complete
for i, future in enumerate(concurrent.futures.as_completed(futures)):
try:
idx, images = future.result()
all_images_h[idx] = images
except Exception as e:
print(f'Error processing partition: {e}')
results = [None] * len(pdf_paths)
for i in range(len(partitions)):
partition = partitions[i]
for j in range(len(partition)):
with open(pdf_info[partition[j]][0], 'rb') as f:
pdf_bytes = f.read()
dataset = PymuDocDataset(pdf_bytes, lang=lang)
dataset.set_images(all_images_h[i][j])
results[partition[j]] = dataset
return results
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment