"vscode:/vscode.git/clone" did not exist on "ecde075d49286f039deb712439c694b01e38bb6a"
Unverified Commit 8957cf17 authored by Xiaomeng Zhao's avatar Xiaomeng Zhao Committed by GitHub
Browse files

Merge pull request #2941 from opendatalab/docs_update

Docs update
parents 720ca126 18b754aa
......@@ -29,7 +29,7 @@ jobs:
path-to-document: 'https://github.com/opendatalab/MinerU/blob/master/MinerU_CLA.md' # e.g. a CLA or a DCO document
# branch should not be protected
branch: 'master'
allowlist: myhloli,dt-yy,Focusshang,renpengli01,icecraft,drunkpig,wangbinDL,qiangqiang199,GDDGCZ518,papayalove,conghui,quyuan,LollipopsAndWine
allowlist: myhloli,dt-yy,Focusshang,renpengli01,icecraft,drunkpig,wangbinDL,qiangqiang199,GDDGCZ518,papayalove,conghui,quyuan,LollipopsAndWine,Sidney233
# the followings are the optional inputs - If the optional inputs are not given, then default values will be taken
#remote-organization-name: enter the remote organization name where the signatures should be stored (Default is storing the signatures in the same repository)
......
name: Publish docs via GitHub Pages
on:
push:
branches:
- "master"
- "dev"
jobs:
build:
name: Deploy docs
runs-on: ubuntu-latest
steps:
- name: Checkout master
uses: actions/checkout@v4
with:
ref: dev
- name: Deploy docs
uses: mhausenblas/mkdocs-deploy-gh-pages@master
# Or use mhausenblas/mkdocs-deploy-gh-pages@nomaterial to build without the mkdocs-material theme
env:
PERSONAL_TOKEN: ${{ secrets.RELEASE_TOKEN }}
REQUIREMENTS: /docs/requirements.txt
# Frequently Asked Questions
### 1. Encountered the error `ImportError: libGL.so.1: cannot open shared object file: No such file or directory` in Ubuntu 22.04 on WSL2
## 1. Encountered the error `ImportError: libGL.so.1: cannot open shared object file: No such file or directory` in Ubuntu 22.04 on WSL2
The `libgl` library is missing in Ubuntu 22.04 on WSL2. You can install the `libgl` library with the following command to resolve the issue:
......@@ -11,7 +11,7 @@ sudo apt-get install libgl1-mesa-glx
Reference: https://github.com/opendatalab/MinerU/issues/388
### 2. Error when installing MinerU on CentOS 7 or Ubuntu 18: `ERROR: Failed building wheel for simsimd`
## 2. Error when installing MinerU on CentOS 7 or Ubuntu 18: `ERROR: Failed building wheel for simsimd`
The new version of albumentations (1.4.21) introduces a dependency on simsimd. Since the pre-built package of simsimd for Linux requires a glibc version greater than or equal to 2.28, this causes installation issues on some Linux distributions released before 2019. You can resolve this issue by using the following command:
```
......
<div align="center" xmlns="http://www.w3.org/1999/html">
<!-- logo -->
<p align="center">
<img src="images/MinerU-logo.png" width="300px" style="vertical-align:middle;">
</p>
</div>
<!-- icon -->
[![stars](https://img.shields.io/github/stars/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU)
[![forks](https://img.shields.io/github/forks/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU)
[![open issues](https://img.shields.io/github/issues-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues)
[![issue resolution](https://img.shields.io/github/issues-closed-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues)
[![PyPI version](https://img.shields.io/pypi/v/mineru)](https://pypi.org/project/mineru/)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/mineru)](https://pypi.org/project/mineru/)
[![Downloads](https://static.pepy.tech/badge/mineru)](https://pepy.tech/project/mineru)
[![Downloads](https://static.pepy.tech/badge/mineru/month)](https://pepy.tech/project/mineru)
[![OpenDataLab](https://img.shields.io/badge/Demo_on_OpenDataLab-blue?logo=data:image/svg+xml;base64,PHN2ZyB3aWR0aD0iMTM0IiBoZWlnaHQ9IjEzNCIgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzIwMDAvc3ZnIj48cGF0aCBkPSJtMTIyLDljMCw1LTQsOS05LDlzLTktNC05LTksNC05LDktOSw5LDQsOSw5eiIgZmlsbD0idXJsKCNhKSIvPjxwYXRoIGQ9Im0xMjIsOWMwLDUtNCw5LTksOXMtOS00LTktOSw0LTksOS05LDksNCw5LDl6IiBmaWxsPSIjMDEwMTAxIi8+PHBhdGggZD0ibTkxLDE4YzAsNS00LDktOSw5cy05LTQtOS05LDQtOSw5LTksOSw0LDksOXoiIGZpbGw9InVybCgjYikiLz48cGF0aCBkPSJtOTEsMThjMCw1LTQsOS05LDlzLTktNC05LTksNC05LDktOSw5LDQsOSw5eiIgZmlsbD0iIzAxMDEwMSIvPjxwYXRoIGZpbGwtcnVsZT0iZXZlbm9kZCIgY2xpcC1ydWxlPSJldmVub2RkIiBkPSJtMzksNjJjMCwxNiw4LDMwLDIwLDM4LDctNiwxMi0xNiwxMi0yNlY0OWMwLTQsMy03LDYtOGw0Ni0xMmM1LTEsMTEsMywxMSw4djMxYzAsMzctMzAsNjYtNjYsNjYtMzcsMC02Ni0zMC02Ni02NlY0NmMwLTQsMy03LDYtOGwyMC02YzUtMSwxMSwzLDExLDh2MjF6bS0yOSw2YzAsMTYsNiwzMCwxNyw0MCwzLDEsNSwxLDgsMSw1LDAsMTAtMSwxNS0zQzM3LDk1LDI5LDc5LDI5LDYyVjQybC0xOSw1djIweiIgZmlsbD0idXJsKCNjKSIvPjxwYXRoIGZpbGwtcnVsZT0iZXZlbm9kZCIgY2xpcC1ydWxlPSJldmVub2RkIiBkPSJtMzksNjJjMCwxNiw4LDMwLDIwLDM4LDctNiwxMi0xNiwxMi0yNlY0OWMwLTQsMy03LDYtOGw0Ni0xMmM1LTEsMTEsMywxMSw4djMxYzAsMzctMzAsNjYtNjYsNjYtMzcsMC02Ni0zMC02Ni02NlY0NmMwLTQsMy03LDYtOGwyMC02YzUtMSwxMSwzLDExLDh2MjF6bS0yOSw2YzAsMTYsNiwzMCwxNyw0MCwzLDEsNSwxLDgsMSw1LDAsMTAtMSwxNS0zQzM3LDk1LDI5LDc5LDI5LDYyVjQybC0xOSw1djIweiIgZmlsbD0iIzAxMDEwMSIvPjxkZWZzPjxsaW5lYXJHcmFkaWVudCBpZD0iYSIgeDE9Ijg0IiB5MT0iNDEiIHgyPSI3NSIgeTI9IjEyMCIgZ3JhZGllbnRVbml0cz0idXNlclNwYWNlT25Vc2UiPjxzdG9wIHN0b3AtY29sb3I9IiNmZmYiLz48c3RvcCBvZmZzZXQ9IjEiIHN0b3AtY29sb3I9IiMyZTJlMmUiLz48L2xpbmVhckdyYWRpZW50PjxsaW5lYXJHcmFkaWVudCBpZD0iYiIgeDE9Ijg0IiB5MT0iNDEiIHgyPSI3NSIgeTI9IjEyMCIgZ3JhZGllbnRVbml0cz0idXNlclNwYWNlT25Vc2UiPjxzdG9wIHN0b3AtY29sb3I9IiNmZmYiLz48c3RvcCBvZmZzZXQ9IjEiIHN0b3AtY29sb3I9IiMyZTJlMmUiLz48L2xpbmVhckdyYWRpZW50PjxsaW5lYXJHcmFkaWVudCBpZD0iYyIgeDE9Ijg0IiB5MT0iNDEiIHgyPSI3NSIgeTI9IjEyMCIgZ3JhZGllbnRVbml0cz0idXNlclNwYWNlT25Vc2UiPjxzdG9wIHN0b3AtY29sb3I9IiNmZmYiLz48c3RvcCBvZmZzZXQ9IjEiIHN0b3AtY29sb3I9IiMyZTJlMmUiLz48L2xpbmVhckdyYWRpZW50PjwvZGVmcz48L3N2Zz4=&labelColor=white)](https://mineru.net/OpenSourceTools/Extractor?source=github)
[![HuggingFace](https://img.shields.io/badge/Demo_on_HuggingFace-yellow.svg?logo=data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAF8AAABYCAMAAACkl9t/AAAAk1BMVEVHcEz/nQv/nQv/nQr/nQv/nQr/nQv/nQv/nQr/wRf/txT/pg7/yRr/rBD/zRz/ngv/oAz/zhz/nwv/txT/ngv/0B3+zBz/nQv/0h7/wxn/vRb/thXkuiT/rxH/pxD/ogzcqyf/nQvTlSz/czCxky7/SjifdjT/Mj3+Mj3wMj15aTnDNz+DSD9RTUBsP0FRO0Q6O0WyIxEIAAAAGHRSTlMADB8zSWF3krDDw8TJ1NbX5efv8ff9/fxKDJ9uAAAGKklEQVR42u2Z63qjOAyGC4RwCOfB2JAGqrSb2WnTw/1f3UaWcSGYNKTdf/P+mOkTrE+yJBulvfvLT2A5ruenaVHyIks33npl/6C4s/ZLAM45SOi/1FtZPyFur1OYofBX3w7d54Bxm+E8db+nDr12ttmESZ4zludJEG5S7TO72YPlKZFyE+YCYUJTBZsMiNS5Sd7NlDmKM2Eg2JQg8awbglfqgbhArjxkS7dgp2RH6hc9AMLdZYUtZN5DJr4molC8BfKrEkPKEnEVjLbgW1fLy77ZVOJagoIcLIl+IxaQZGjiX597HopF5CkaXVMDO9Pyix3AFV3kw4lQLCbHuMovz8FallbcQIJ5Ta0vks9RnolbCK84BtjKRS5uA43hYoZcOBGIG2Epbv6CvFVQ8m8loh66WNySsnN7htL58LNp+NXT8/PhXiBXPMjLSxtwp8W9f/1AngRierBkA+kk/IpUSOeKByzn8y3kAAAfh//0oXgV4roHm/kz4E2z//zRc3/lgwBzbM2mJxQEa5pqgX7d1L0htrhx7LKxOZlKbwcAWyEOWqYSI8YPtgDQVjpB5nvaHaSnBaQSD6hweDi8PosxD6/PT09YY3xQA7LTCTKfYX+QHpA0GCcqmEHvr/cyfKQTEuwgbs2kPxJEB0iNjfJcCTPyocx+A0griHSmADiC91oNGVwJ69RudYe65vJmoqfpul0lrqXadW0jFKH5BKwAeCq+Den7s+3zfRJzA61/Uj/9H/VzLKTx9jFPPdXeeP+L7WEvDLAKAIoF8bPTKT0+TM7W8ePj3Rz/Yn3kOAp2f1Kf0Weony7pn/cPydvhQYV+eFOfmOu7VB/ViPe34/EN3RFHY/yRuT8ddCtMPH/McBAT5s+vRde/gf2c/sPsjLK+m5IBQF5tO+h2tTlBGnP6693JdsvofjOPnnEHkh2TnV/X1fBl9S5zrwuwF8NFrAVJVwCAPTe8gaJlomqlp0pv4Pjn98tJ/t/fL++6unpR1YGC2n/KCoa0tTLoKiEeUPDl94nj+5/Tv3/eT5vBQ60X1S0oZr+IWRR8Ldhu7AlLjPISlJcO9vrFotky9SpzDequlwEir5beYAc0R7D9KS1DXva0jhYRDXoExPdc6yw5GShkZXe9QdO/uOvHofxjrV/TNS6iMJS+4TcSTgk9n5agJdBQbB//IfF/HpvPt3Tbi7b6I6K0R72p6ajryEJrENW2bbeVUGjfgoals4L443c7BEE4mJO2SpbRngxQrAKRudRzGQ8jVOL2qDVjjI8K1gc3TIJ5KiFZ1q+gdsARPB4NQS4AjwVSt72DSoXNyOWUrU5mQ9nRYyjp89Xo7oRI6Bga9QNT1mQ/ptaJq5T/7WcgAZywR/XlPGAUDdet3LE+qS0TI+g+aJU8MIqjo0Kx8Ly+maxLjJmjQ18rA0YCkxLQbUZP1WqdmyQGJLUm7VnQFqodmXSqmRrdVpqdzk5LvmvgtEcW8PMGdaS23EOWyDVbACZzUJPaqMbjDxpA3Qrgl0AikimGDbqmyT8P8NOYiqrldF8rX+YN7TopX4UoHuSCYY7cgX4gHwclQKl1zhx0THf+tCAUValzjI7Wg9EhptrkIcfIJjA94evOn8B2eHaVzvBrnl2ig0So6hvPaz0IGcOvTHvUIlE2+prqAxLSQxZlU2stql1NqCCLdIiIN/i1DBEHUoElM9dBravbiAnKqgpi4IBkw+utSPIoBijDXJipSVV7MpOEJUAc5Qmm3BnUN+w3hteEieYKfRZSIUcXKMVf0u5wD4EwsUNVvZOtUT7A2GkffHjByWpHqvRBYrTV72a6j8zZ6W0DTE86Hn04bmyWX3Ri9WH7ZU6Q7h+ZHo0nHUAcsQvVhXRDZHChwiyi/hnPuOsSEF6Exk3o6Y9DT1eZ+6cASXk2Y9k+6EOQMDGm6WBK10wOQJCBwren86cPPWUcRAnTVjGcU1LBgs9FURiX/e6479yZcLwCBmTxiawEwrOcleuu12t3tbLv/N4RLYIBhYexm7Fcn4OJcn0+zc+s8/VfPeddZHAGN6TT8eGczHdR/Gts1/MzDkThr23zqrVfAMFT33Nx1RJsx1k5zuWILLnG/vsH+Fv5D4NTVcp1Gzo8AAAAAElFTkSuQmCC&labelColor=white)](https://huggingface.co/spaces/opendatalab/MinerU)
[![ModelScope](https://img.shields.io/badge/Demo_on_ModelScope-purple?logo=data:image/svg+xml;base64,PHN2ZyB3aWR0aD0iMjIzIiBoZWlnaHQ9IjIwMCIgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzIwMDAvc3ZnIj4KCiA8Zz4KICA8dGl0bGU+TGF5ZXIgMTwvdGl0bGU+CiAgPHBhdGggaWQ9InN2Z18xNCIgZmlsbD0iIzYyNGFmZiIgZD0ibTAsODkuODRsMjUuNjUsMGwwLDI1LjY0OTk5bC0yNS42NSwwbDAsLTI1LjY0OTk5eiIvPgogIDxwYXRoIGlkPSJzdmdfMTUiIGZpbGw9IiM2MjRhZmYiIGQ9Im05OS4xNCwxMTUuNDlsMjUuNjUsMGwwLDI1LjY1bC0yNS42NSwwbDAsLTI1LjY1eiIvPgogIDxwYXRoIGlkPSJzdmdfMTYiIGZpbGw9IiM2MjRhZmYiIGQ9Im0xNzYuMDksMTQxLjE0bC0yNS42NDk5OSwwbDAsMjIuMTlsNDcuODQsMGwwLC00Ny44NGwtMjIuMTksMGwwLDI1LjY1eiIvPgogIDxwYXRoIGlkPSJzdmdfMTciIGZpbGw9IiMzNmNmZDEiIGQ9Im0xMjQuNzksODkuODRsMjUuNjUsMGwwLDI1LjY0OTk5bC0yNS42NSwwbDAsLTI1LjY0OTk5eiIvPgogIDxwYXRoIGlkPSJzdmdfMTgiIGZpbGw9IiMzNmNmZDEiIGQ9Im0wLDY0LjE5bDI1LjY1LDBsMCwyNS42NWwtMjUuNjUsMGwwLC0yNS42NXoiLz4KICA8cGF0aCBpZD0ic3ZnXzE5IiBmaWxsPSIjNjI0YWZmIiBkPSJtMTk4LjI4LDg5Ljg0bDI1LjY0OTk5LDBsMCwyNS42NDk5OWwtMjUuNjQ5OTksMGwwLC0yNS42NDk5OXoiLz4KICA8cGF0aCBpZD0ic3ZnXzIwIiBmaWxsPSIjMzZjZmQxIiBkPSJtMTk4LjI4LDY0LjE5bDI1LjY0OTk5LDBsMCwyNS42NWwtMjUuNjQ5OTksMGwwLC0yNS42NXoiLz4KICA8cGF0aCBpZD0ic3ZnXzIxIiBmaWxsPSIjNjI0YWZmIiBkPSJtMTUwLjQ0LDQybDAsMjIuMTlsMjUuNjQ5OTksMGwwLDI1LjY1bDIyLjE5LDBsMCwtNDcuODRsLTQ3Ljg0LDB6Ii8+CiAgPHBhdGggaWQ9InN2Z18yMiIgZmlsbD0iIzM2Y2ZkMSIgZD0ibTczLjQ5LDg5Ljg0bDI1LjY1LDBsMCwyNS42NDk5OWwtMjUuNjUsMGwwLC0yNS42NDk5OXoiLz4KICA8cGF0aCBpZD0ic3ZnXzIzIiBmaWxsPSIjNjI0YWZmIiBkPSJtNDcuODQsNjQuMTlsMjUuNjUsMGwwLC0yMi4xOWwtNDcuODQsMGwwLDQ3Ljg0bDIyLjE5LDBsMCwtMjUuNjV6Ii8+CiAgPHBhdGggaWQ9InN2Z18yNCIgZmlsbD0iIzYyNGFmZiIgZD0ibTQ3Ljg0LDExNS40OWwtMjIuMTksMGwwLDQ3Ljg0bDQ3Ljg0LDBsMCwtMjIuMTlsLTI1LjY1LDBsMCwtMjUuNjV6Ii8+CiA8L2c+Cjwvc3ZnPg==&labelColor=white)](https://www.modelscope.cn/studios/OpenDataLab/MinerU)
[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/gist/myhloli/3b3a00a4a0a61577b6c30f989092d20d/mineru_demo.ipynb)
[![arXiv](https://img.shields.io/badge/arXiv-2409.18839-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2409.18839)
[![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/opendatalab/MinerU)
<div align="center" xmlns="http://www.w3.org/1999/html">
<a href="https://trendshift.io/repositories/11174" target="_blank"><img src="https://trendshift.io/api/badge/repositories/11174" alt="opendatalab%2FMinerU | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
<!-- hot link -->
<p align="center">
<a href="https://github.com/opendatalab/PDF-Extract-Kit">PDF-Extract-Kit: High-Quality PDF Extraction Toolkit</a>🔥🔥🔥
<br>
<br>
🚀<a href="https://mineru.net/?source=github">Access MinerU Now→✅ Zero-Install Web Version ✅ Full-Featured Desktop Client ✅ Instant API Access; Skip deployment headaches – get all product formats in one click. Developers, dive in!</a>
</p>
<!-- join us -->
<p align="center">
👋 join us on <a href="https://discord.gg/Tdedn9GTXq" target="_blank">Discord</a> and <a href="http://mineru.space/s/V85Yl" target="_blank">WeChat</a>
</p>
</div>
## Project Introduction
MinerU is a tool that converts PDFs into machine-readable formats (e.g., markdown, JSON), allowing for easy extraction into any format.
MinerU was born during the pre-training process of [InternLM](https://github.com/InternLM/InternLM). We focus on solving symbol conversion issues in scientific literature and hope to contribute to technological development in the era of large models.
Compared to well-known commercial products, MinerU is still young. If you encounter any issues or if the results are not as expected, please submit an issue on [issue](https://github.com/opendatalab/MinerU/issues) and **attach the relevant PDF**.
![type:video](https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c)
## Key Features
- Remove headers, footers, footnotes, page numbers, etc., to ensure semantic coherence.
- Output text in human-readable order, suitable for single-column, multi-column, and complex layouts.
- Preserve the structure of the original document, including headings, paragraphs, lists, etc.
- Extract images, image descriptions, tables, table titles, and footnotes.
- Automatically recognize and convert formulas in the document to LaTeX format.
- Automatically recognize and convert tables in the document to HTML format.
- Automatically detect scanned PDFs and garbled PDFs and enable OCR functionality.
- OCR supports detection and recognition of 84 languages.
- Supports multiple output formats, such as multimodal and NLP Markdown, JSON sorted by reading order, and rich intermediate formats.
- Supports various visualization results, including layout visualization and span visualization, for efficient confirmation of output quality.
- Supports running in a pure CPU environment, and also supports GPU(CUDA)/NPU(CANN)/MPS acceleration
- Compatible with Windows, Linux, and Mac platforms.
\ No newline at end of file
# Known Issues
- Reading order is determined by the model based on the spatial distribution of readable content, and may be out of order in some areas under extremely complex layouts.
- Limited support for vertical text.
- Tables of contents and lists are recognized through rules, and some uncommon list formats may not be recognized.
- Code blocks are not yet supported in the layout model.
- Comic books, art albums, primary school textbooks, and exercises cannot be parsed well.
- Table recognition may result in row/column recognition errors in complex tables.
- OCR recognition may produce inaccurate characters in PDFs of lesser-known languages (e.g., diacritical marks in Latin script, easily confused characters in Arabic script).
- Some formulas may not render correctly in Markdown.
## Overview
# Overview
After executing the `mineru` command, in addition to outputting files related to markdown, several other files unrelated to markdown will also be generated. These files will be introduced one by one.
### some_pdf_layout.pdf
## some_pdf_layout.pdf
Each page's layout consists of one or more bounding boxes. The number in the top-right corner of each box indicates the reading order. Additionally, different content blocks are highlighted with distinct background colors within the layout.pdf.
![layout example](images/layout_example.png)
![layout example](../images/layout_example.png)
### some_pdf_spans.pdf(Applicable only to the pipeline backend)
## some_pdf_spans.pdf(Applicable only to the pipeline backend)
All spans on the page are drawn with different colored line frames according to the span type. This file can be used for quality control, allowing for quick identification of issues such as missing text or unrecognized inline formulas.
![spans example](images/spans_example.png)
![spans example](../images/spans_example.png)
### some_pdf_model.json(Applicable only to the pipeline backend)
## some_pdf_model.json(Applicable only to the pipeline backend)
#### Structure Definition
### Structure Definition
```python
from pydantic import BaseModel, Field
......@@ -61,9 +61,9 @@ inference_result: list[PageInferenceResults] = []
```
The format of the poly coordinates is \[x0, y0, x1, y1, x2, y2, x3, y3\], representing the coordinates of the top-left, top-right, bottom-right, and bottom-left points respectively.
![Poly Coordinate Diagram](images/poly.png)
![Poly Coordinate Diagram](../images/poly.png)
#### example
### example
```json
[
......@@ -116,7 +116,7 @@ The format of the poly coordinates is \[x0, y0, x1, y1, x2, y2, x3, y3\], repres
]
```
### some_pdf_model_output.txt (Applicable only to the VLM backend)
## some_pdf_model_output.txt (Applicable only to the VLM backend)
This file contains the output of the VLM model, with each page's output separated by `----`.
Each page's output consists of text blocks starting with `<|box_start|>` and ending with `<|md_end|>`.
......@@ -142,7 +142,7 @@ The meaning of each field is as follows:
This field contains the Markdown content of the block. If `type` is `text`, the end of the text may contain the `<|txt_contd|>` tag, indicating that this block can be connected with the following `text` block(s).
If `type` is `table`, the content is in `otsl` format and needs to be converted into HTML for rendering in Markdown.
### some_pdf_middle.json
## some_pdf_middle.json
| Field Name | Description |
|:---------------| :------------------------------------------------------------------------------------------------------------- |
......@@ -251,7 +251,7 @@ The block structure is as follows:
First-level block (if any) -> Second-level block -> Line -> Span
#### example
### example
```json
{
......@@ -355,7 +355,7 @@ First-level block (if any) -> Second-level block -> Line -> Span
```
### some_pdf_content_list.json
## some_pdf_content_list.json
This file is a JSON array where each element is a dict storing all readable content blocks in the document in reading order.
`content_list` can be viewed as a simplified version of `middle.json`. The content block types are mostly consistent with those in `middle.json`, but layout information is not included.
......@@ -376,7 +376,7 @@ Please note that both `title` and text blocks in `content_list` are uniformly re
Each content contains the `page_idx` field, indicating the page number (starting from 0) where the content block resides.
#### example
### example
```json
[
......
# Quick Start
If you encounter any installation issues, please first consult the [FAQ](../FAQ/index.md).
If the parsing results are not as expected, refer to the [Known Issues](../known_issues.md).
There are three different ways to experience MinerU:
- [Online Demo](online_demo.md)
- [Local Deployment](local_deployment.md)
> [!WARNING]
> **Pre-installation Notice—Hardware and Software Environment Support**
>
> To ensure the stability and reliability of the project, we only optimize and test for specific hardware and software environments during development. This ensures that users deploying and running the project on recommended system configurations will get the best performance with the fewest compatibility issues.
>
> By focusing resources on the mainline environment, our team can more efficiently resolve potential bugs and develop new features.
>
> In non-mainline environments, due to the diversity of hardware and software configurations, as well as third-party dependency compatibility issues, we cannot guarantee 100% project availability. Therefore, for users who wish to use this project in non-recommended environments, we suggest carefully reading the documentation and FAQ first. Most issues already have corresponding solutions in the FAQ. We also encourage community feedback to help us gradually expand support.
<table>
<tr>
<td>Parsing Backend</td>
<td>pipeline</td>
<td>vlm-transformers</td>
<td>vlm-sglang</td>
</tr>
<tr>
<td>Operating System</td>
<td>windows/linux/mac</td>
<td>windows/linux</td>
<td>windows(wsl2)/linux</td>
</tr>
<tr>
<td>CPU Inference Support</td>
<td></td>
<td colspan="2"></td>
</tr>
<tr>
<td>GPU Requirements</td>
<td>Turing architecture or later, 6GB+ VRAM or Apple Silicon</td>
<td colspan="2">Ampere architecture or later, 8GB+ VRAM</td>
</tr>
<tr>
<td>Memory Requirements</td>
<td colspan="3">Minimum 16GB+, 32GB+ recommended</td>
</tr>
<tr>
<td>Disk Space Requirements</td>
<td colspan="3">20GB+, SSD recommended</td>
</tr>
<tr>
<td>Python Version</td>
<td colspan="3">3.10-3.13</td>
</tr>
</table>
\ No newline at end of file
# Local Deployment
## Install MinerU
### Install via pip or uv
```bash
pip install --upgrade pip
pip install uv
uv pip install -U "mineru[core]"
```
### Install from source
```bash
git clone https://github.com/opendatalab/MinerU.git
cd MinerU
uv pip install -e .[core]
```
> [!NOTE]
> Linux and macOS systems automatically support CUDA/MPS acceleration after installation. For Windows users who want to use CUDA acceleration,
> please visit the [PyTorch official website](https://pytorch.org/get-started/locally/) to install PyTorch with the appropriate CUDA version.
### Install Full Version (supports sglang acceleration) (requires device with Turing or newer architecture and at least 8GB GPU memory)
If you need to use **sglang to accelerate VLM model inference**, you can choose any of the following methods to install the full version:
- Install using uv or pip:
```bash
uv pip install -U "mineru[all]"
```
- Install from source:
```bash
uv pip install -e .[all]
```
> [!TIP]
> If any exceptions occur during the installation of `sglang`, please refer to the [official sglang documentation](https://docs.sglang.ai/start/install.html) for troubleshooting and solutions, or directly use Docker-based installation.
- Build image using Dockerfile:
```bash
wget https://gcore.jsdelivr.net/gh/opendatalab/MinerU@master/docker/global/Dockerfile
docker build -t mineru-sglang:latest -f Dockerfile .
```
Start Docker container:
```bash
docker run --gpus all \
--shm-size 32g \
-p 30000:30000 \
--ipc=host \
mineru-sglang:latest \
mineru-sglang-server --host 0.0.0.0 --port 30000
```
Or start using Docker Compose:
```bash
wget https://gcore.jsdelivr.net/gh/opendatalab/MinerU@master/docker/compose.yaml
docker compose -f compose.yaml up -d
```
> [!TIP]
> The Dockerfile uses `lmsysorg/sglang:v0.4.8.post1-cu126` as the default base image, which supports the Turing/Ampere/Ada Lovelace/Hopper platforms.
> If you are using the newer Blackwell platform, please change the base image to `lmsysorg/sglang:v0.4.8.post1-cu128-b200`.
### Install client (for connecting to sglang-server on edge devices that require only CPU and network connectivity)
```bash
uv pip install -U mineru
mineru -p <input_path> -o <output_path> -b vlm-sglang-client -u http://<host_ip>:<port>
```
---
\ No newline at end of file
# Online Demo
[![OpenDataLab](https://img.shields.io/badge/Demo_on_OpenDataLab-blue?logo=data:image/svg+xml;base64,PHN2ZyB3aWR0aD0iMTM0IiBoZWlnaHQ9IjEzNCIgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzIwMDAvc3ZnIj48cGF0aCBkPSJtMTIyLDljMCw1LTQsOS05LDlzLTktNC05LTksNC05LDktOSw5LDQsOSw5eiIgZmlsbD0idXJsKCNhKSIvPjxwYXRoIGQ9Im0xMjIsOWMwLDUtNCw5LTksOXMtOS00LTktOSw0LTksOS05LDksNCw5LDl6IiBmaWxsPSIjMDEwMTAxIi8+PHBhdGggZD0ibTkxLDE4YzAsNS00LDktOSw5cy05LTQtOS05LDQtOSw5LTksOSw0LDksOXoiIGZpbGw9InVybCgjYikiLz48cGF0aCBkPSJtOTEsMThjMCw1LTQsOS05LDlzLTktNC05LTksNC05LDktOSw5LDQsOSw5eiIgZmlsbD0iIzAxMDEwMSIvPjxwYXRoIGZpbGwtcnVsZT0iZXZlbm9kZCIgY2xpcC1ydWxlPSJldmVub2RkIiBkPSJtMzksNjJjMCwxNiw4LDMwLDIwLDM4LDctNiwxMi0xNiwxMi0yNlY0OWMwLTQsMy03LDYtOGw0Ni0xMmM1LTEsMTEsMywxMSw4djMxYzAsMzctMzAsNjYtNjYsNjYtMzcsMC02Ni0zMC02Ni02NlY0NmMwLTQsMy03LDYtOGwyMC02YzUtMSwxMSwzLDExLDh2MjF6bS0yOSw2YzAsMTYsNiwzMCwxNyw0MCwzLDEsNSwxLDgsMSw1LDAsMTAtMSwxNS0zQzM3LDk1LDI5LDc5LDI5LDYyVjQybC0xOSw1djIweiIgZmlsbD0idXJsKCNjKSIvPjxwYXRoIGZpbGwtcnVsZT0iZXZlbm9kZCIgY2xpcC1ydWxlPSJldmVub2RkIiBkPSJtMzksNjJjMCwxNiw4LDMwLDIwLDM4LDctNiwxMi0xNiwxMi0yNlY0OWMwLTQsMy03LDYtOGw0Ni0xMmM1LTEsMTEsMywxMSw4djMxYzAsMzctMzAsNjYtNjYsNjYtMzcsMC02Ni0zMC02Ni02NlY0NmMwLTQsMy03LDYtOGwyMC02YzUtMSwxMSwzLDExLDh2MjF6bS0yOSw2YzAsMTYsNiwzMCwxNyw0MCwzLDEsNSwxLDgsMSw1LDAsMTAtMSwxNS0zQzM3LDk1LDI5LDc5LDI5LDYyVjQybC0xOSw1djIweiIgZmlsbD0iIzAxMDEwMSIvPjxkZWZzPjxsaW5lYXJHcmFkaWVudCBpZD0iYSIgeDE9Ijg0IiB5MT0iNDEiIHgyPSI3NSIgeTI9IjEyMCIgZ3JhZGllbnRVbml0cz0idXNlclNwYWNlT25Vc2UiPjxzdG9wIHN0b3AtY29sb3I9IiNmZmYiLz48c3RvcCBvZmZzZXQ9IjEiIHN0b3AtY29sb3I9IiMyZTJlMmUiLz48L2xpbmVhckdyYWRpZW50PjxsaW5lYXJHcmFkaWVudCBpZD0iYiIgeDE9Ijg0IiB5MT0iNDEiIHgyPSI3NSIgeTI9IjEyMCIgZ3JhZGllbnRVbml0cz0idXNlclNwYWNlT25Vc2UiPjxzdG9wIHN0b3AtY29sb3I9IiNmZmYiLz48c3RvcCBvZmZzZXQ9IjEiIHN0b3AtY29sb3I9IiMyZTJlMmUiLz48L2xpbmVhckdyYWRpZW50PjxsaW5lYXJHcmFkaWVudCBpZD0iYyIgeDE9Ijg0IiB5MT0iNDEiIHgyPSI3NSIgeTI9IjEyMCIgZ3JhZGllbnRVbml0cz0idXNlclNwYWNlT25Vc2UiPjxzdG9wIHN0b3AtY29sb3I9IiNmZmYiLz48c3RvcCBvZmZzZXQ9IjEiIHN0b3AtY29sb3I9IiMyZTJlMmUiLz48L2xpbmVhckdyYWRpZW50PjwvZGVmcz48L3N2Zz4=&labelColor=white)](https://mineru.net/OpenSourceTools/Extractor?source=github)
[![HuggingFace](https://img.shields.io/badge/Demo_on_HuggingFace-yellow.svg?logo=data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAF8AAABYCAMAAACkl9t/AAAAk1BMVEVHcEz/nQv/nQv/nQr/nQv/nQr/nQv/nQv/nQr/wRf/txT/pg7/yRr/rBD/zRz/ngv/oAz/zhz/nwv/txT/ngv/0B3+zBz/nQv/0h7/wxn/vRb/thXkuiT/rxH/pxD/ogzcqyf/nQvTlSz/czCxky7/SjifdjT/Mj3+Mj3wMj15aTnDNz+DSD9RTUBsP0FRO0Q6O0WyIxEIAAAAGHRSTlMADB8zSWF3krDDw8TJ1NbX5efv8ff9/fxKDJ9uAAAGKklEQVR42u2Z63qjOAyGC4RwCOfB2JAGqrSb2WnTw/1f3UaWcSGYNKTdf/P+mOkTrE+yJBulvfvLT2A5ruenaVHyIks33npl/6C4s/ZLAM45SOi/1FtZPyFur1OYofBX3w7d54Bxm+E8db+nDr12ttmESZ4zludJEG5S7TO72YPlKZFyE+YCYUJTBZsMiNS5Sd7NlDmKM2Eg2JQg8awbglfqgbhArjxkS7dgp2RH6hc9AMLdZYUtZN5DJr4molC8BfKrEkPKEnEVjLbgW1fLy77ZVOJagoIcLIl+IxaQZGjiX597HopF5CkaXVMDO9Pyix3AFV3kw4lQLCbHuMovz8FallbcQIJ5Ta0vks9RnolbCK84BtjKRS5uA43hYoZcOBGIG2Epbv6CvFVQ8m8loh66WNySsnN7htL58LNp+NXT8/PhXiBXPMjLSxtwp8W9f/1AngRierBkA+kk/IpUSOeKByzn8y3kAAAfh//0oXgV4roHm/kz4E2z//zRc3/lgwBzbM2mJxQEa5pqgX7d1L0htrhx7LKxOZlKbwcAWyEOWqYSI8YPtgDQVjpB5nvaHaSnBaQSD6hweDi8PosxD6/PT09YY3xQA7LTCTKfYX+QHpA0GCcqmEHvr/cyfKQTEuwgbs2kPxJEB0iNjfJcCTPyocx+A0griHSmADiC91oNGVwJ69RudYe65vJmoqfpul0lrqXadW0jFKH5BKwAeCq+Den7s+3zfRJzA61/Uj/9H/VzLKTx9jFPPdXeeP+L7WEvDLAKAIoF8bPTKT0+TM7W8ePj3Rz/Yn3kOAp2f1Kf0Weony7pn/cPydvhQYV+eFOfmOu7VB/ViPe34/EN3RFHY/yRuT8ddCtMPH/McBAT5s+vRde/gf2c/sPsjLK+m5IBQF5tO+h2tTlBGnP6693JdsvofjOPnnEHkh2TnV/X1fBl9S5zrwuwF8NFrAVJVwCAPTe8gaJlomqlp0pv4Pjn98tJ/t/fL++6unpR1YGC2n/KCoa0tTLoKiEeUPDl94nj+5/Tv3/eT5vBQ60X1S0oZr+IWRR8Ldhu7AlLjPISlJcO9vrFotky9SpzDequlwEir5beYAc0R7D9KS1DXva0jhYRDXoExPdc6yw5GShkZXe9QdO/uOvHofxjrV/TNS6iMJS+4TcSTgk9n5agJdBQbB//IfF/HpvPt3Tbi7b6I6K0R72p6ajryEJrENW2bbeVUGjfgoals4L443c7BEE4mJO2SpbRngxQrAKRudRzGQ8jVOL2qDVjjI8K1gc3TIJ5KiFZ1q+gdsARPB4NQS4AjwVSt72DSoXNyOWUrU5mQ9nRYyjp89Xo7oRI6Bga9QNT1mQ/ptaJq5T/7WcgAZywR/XlPGAUDdet3LE+qS0TI+g+aJU8MIqjo0Kx8Ly+maxLjJmjQ18rA0YCkxLQbUZP1WqdmyQGJLUm7VnQFqodmXSqmRrdVpqdzk5LvmvgtEcW8PMGdaS23EOWyDVbACZzUJPaqMbjDxpA3Qrgl0AikimGDbqmyT8P8NOYiqrldF8rX+YN7TopX4UoHuSCYY7cgX4gHwclQKl1zhx0THf+tCAUValzjI7Wg9EhptrkIcfIJjA94evOn8B2eHaVzvBrnl2ig0So6hvPaz0IGcOvTHvUIlE2+prqAxLSQxZlU2stql1NqCCLdIiIN/i1DBEHUoElM9dBravbiAnKqgpi4IBkw+utSPIoBijDXJipSVV7MpOEJUAc5Qmm3BnUN+w3hteEieYKfRZSIUcXKMVf0u5wD4EwsUNVvZOtUT7A2GkffHjByWpHqvRBYrTV72a6j8zZ6W0DTE86Hn04bmyWX3Ri9WH7ZU6Q7h+ZHo0nHUAcsQvVhXRDZHChwiyi/hnPuOsSEF6Exk3o6Y9DT1eZ+6cASXk2Y9k+6EOQMDGm6WBK10wOQJCBwren86cPPWUcRAnTVjGcU1LBgs9FURiX/e6479yZcLwCBmTxiawEwrOcleuu12t3tbLv/N4RLYIBhYexm7Fcn4OJcn0+zc+s8/VfPeddZHAGN6TT8eGczHdR/Gts1/MzDkThr23zqrVfAMFT33Nx1RJsx1k5zuWILLnG/vsH+Fv5D4NTVcp1Gzo8AAAAAElFTkSuQmCC&labelColor=white)](https://huggingface.co/spaces/opendatalab/MinerU)
[![ModelScope](https://img.shields.io/badge/Demo_on_ModelScope-purple?logo=data:image/svg+xml;base64,PHN2ZyB3aWR0aD0iMjIzIiBoZWlnaHQ9IjIwMCIgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzIwMDAvc3ZnIj4KCiA8Zz4KICA8dGl0bGU+TGF5ZXIgMTwvdGl0bGU+CiAgPHBhdGggaWQ9InN2Z18xNCIgZmlsbD0iIzYyNGFmZiIgZD0ibTAsODkuODRsMjUuNjUsMGwwLDI1LjY0OTk5bC0yNS42NSwwbDAsLTI1LjY0OTk5eiIvPgogIDxwYXRoIGlkPSJzdmdfMTUiIGZpbGw9IiM2MjRhZmYiIGQ9Im05OS4xNCwxMTUuNDlsMjUuNjUsMGwwLDI1LjY1bC0yNS42NSwwbDAsLTI1LjY1eiIvPgogIDxwYXRoIGlkPSJzdmdfMTYiIGZpbGw9IiM2MjRhZmYiIGQ9Im0xNzYuMDksMTQxLjE0bC0yNS42NDk5OSwwbDAsMjIuMTlsNDcuODQsMGwwLC00Ny44NGwtMjIuMTksMGwwLDI1LjY1eiIvPgogIDxwYXRoIGlkPSJzdmdfMTciIGZpbGw9IiMzNmNmZDEiIGQ9Im0xMjQuNzksODkuODRsMjUuNjUsMGwwLDI1LjY0OTk5bC0yNS42NSwwbDAsLTI1LjY0OTk5eiIvPgogIDxwYXRoIGlkPSJzdmdfMTgiIGZpbGw9IiMzNmNmZDEiIGQ9Im0wLDY0LjE5bDI1LjY1LDBsMCwyNS42NWwtMjUuNjUsMGwwLC0yNS42NXoiLz4KICA8cGF0aCBpZD0ic3ZnXzE5IiBmaWxsPSIjNjI0YWZmIiBkPSJtMTk4LjI4LDg5Ljg0bDI1LjY0OTk5LDBsMCwyNS42NDk5OWwtMjUuNjQ5OTksMGwwLC0yNS42NDk5OXoiLz4KICA8cGF0aCBpZD0ic3ZnXzIwIiBmaWxsPSIjMzZjZmQxIiBkPSJtMTk4LjI4LDY0LjE5bDI1LjY0OTk5LDBsMCwyNS42NWwtMjUuNjQ5OTksMGwwLC0yNS42NXoiLz4KICA8cGF0aCBpZD0ic3ZnXzIxIiBmaWxsPSIjNjI0YWZmIiBkPSJtMTUwLjQ0LDQybDAsMjIuMTlsMjUuNjQ5OTksMGwwLDI1LjY1bDIyLjE5LDBsMCwtNDcuODRsLTQ3Ljg0LDB6Ii8+CiAgPHBhdGggaWQ9InN2Z18yMiIgZmlsbD0iIzM2Y2ZkMSIgZD0ibTczLjQ5LDg5Ljg0bDI1LjY1LDBsMCwyNS42NDk5OWwtMjUuNjUsMGwwLC0yNS42NDk5OXoiLz4KICA8cGF0aCBpZD0ic3ZnXzIzIiBmaWxsPSIjNjI0YWZmIiBkPSJtNDcuODQsNjQuMTlsMjUuNjUsMGwwLC0yMi4xOWwtNDcuODQsMGwwLDQ3Ljg0bDIyLjE5LDBsMCwtMjUuNjV6Ii8+CiAgPHBhdGggaWQ9InN2Z18yNCIgZmlsbD0iIzYyNGFmZiIgZD0ibTQ3Ljg0LDExNS40OWwtMjIuMTksMGwwLDQ3Ljg0bDQ3Ljg0LDBsMCwtMjIuMTlsLTI1LjY1LDBsMCwtMjUuNjV6Ii8+CiA8L2c+Cjwvc3ZnPg==&labelColor=white)](https://www.modelscope.cn/studios/OpenDataLab/MinerU)
\ No newline at end of file
# TODO
- [x] Reading order based on the model
- [x] Recognition of `index` and `list` in the main text
- [x] Table recognition
- [x] Heading Classification
- [ ] Code block recognition in the main text
- [ ] [Chemical formula recognition](../chemical_knowledge_introduction/introduction.pdf)
- [ ] Geometric shape recognition
\ No newline at end of file
# API Calls or Visual Invocation
1. Directly invoke using Python API: [Python Invocation Example](https://github.com/opendatalab/MinerU/blob/master/demo/demo.py)
2. Invoke using FastAPI:
```bash
mineru-api --host 127.0.0.1 --port 8000
```
Visit http://127.0.0.1:8000/docs in your browser to view the API documentation.
3. Use Gradio WebUI or Gradio API:
```bash
# Using pipeline/vlm-transformers/vlm-sglang-client backend
mineru-gradio --server-name 127.0.0.1 --server-port 7860
# Or using vlm-sglang-engine/pipeline backend
mineru-gradio --server-name 127.0.0.1 --server-port 7860 --enable-sglang-engine true
```
Access http://127.0.0.1:7860 in your browser to use the Gradio WebUI, or visit http://127.0.0.1:7860/?view=api to use the Gradio API.
> [!TIP]
> - Below are some suggestions and notes for using the sglang acceleration mode:
> - The sglang acceleration mode currently supports operation on Turing architecture GPUs with a minimum of 8GB VRAM, but you may encounter VRAM shortages on GPUs with less than 24GB VRAM. You can optimize VRAM usage with the following parameters:
> - If running on a single GPU and encountering VRAM shortage, reduce the KV cache size by setting `--mem-fraction-static 0.5`. If VRAM issues persist, try lowering it further to `0.4` or below.
> - If you have more than one GPU, you can expand available VRAM using tensor parallelism (TP) mode: `--tp-size 2`
> - If you are already successfully using sglang to accelerate VLM inference but wish to further improve inference speed, consider the following parameters:
> - If using multiple GPUs, increase throughput using sglang's multi-GPU parallel mode: `--dp-size 2`
> - You can also enable `torch.compile` to accelerate inference speed by about 15%: `--enable-torch-compile`
> - For more information on using sglang parameters, please refer to the [sglang official documentation](https://docs.sglang.ai/backend/server_arguments.html#common-launch-commands)
> - All sglang-supported parameters can be passed to MinerU via command-line arguments, including those used with the following commands: `mineru`, `mineru-sglang-server`, `mineru-gradio`, `mineru-api`
> [!TIP]
> - In any case, you can specify visible GPU devices at the start of a command line by adding the `CUDA_VISIBLE_DEVICES` environment variable. For example:
> ```bash
> CUDA_VISIBLE_DEVICES=1 mineru -p <input_path> -o <output_path>
> ```
> - This method works for all command-line calls, including `mineru`, `mineru-sglang-server`, `mineru-gradio`, and `mineru-api`, and applies to both `pipeline` and `vlm` backends.
> - Below are some common `CUDA_VISIBLE_DEVICES` settings:
> ```bash
> CUDA_VISIBLE_DEVICES=1 Only device 1 will be seen
> CUDA_VISIBLE_DEVICES=0,1 Devices 0 and 1 will be visible
> CUDA_VISIBLE_DEVICES="0,1" Same as above, quotation marks are optional
> CUDA_VISIBLE_DEVICES=0,2,3 Devices 0, 2, 3 will be visible; device 1 is masked
> CUDA_VISIBLE_DEVICES="" No GPU will be visible
> ```
> - Below are some possible use cases:
> - If you have multiple GPUs and need to specify GPU 0 and GPU 1 to launch 'sglang-server' in multi-GPU mode, you can use the following command:
> ```bash
> CUDA_VISIBLE_DEVICES=0,1 mineru-sglang-server --port 30000 --dp-size 2
> ```
> - If you have multiple GPUs and need to launch two `fastapi` services on GPU 0 and GPU 1 respectively, listening on different ports, you can use the following commands:
> ```bash
> # In terminal 1
> CUDA_VISIBLE_DEVICES=0 mineru-api --host 127.0.0.1 --port 8000
> # In terminal 2
> CUDA_VISIBLE_DEVICES=1 mineru-api --host 127.0.0.1 --port 8001
> ```
---
# Extending MinerU Functionality Through Configuration Files
- MinerU is designed to work out-of-the-box, but also supports extending functionality through configuration files. You can create a `mineru.json` file in your home directory and add custom configurations.
- The `mineru.json` file will be automatically generated when you use the built-in model download command `mineru-models-download`. Alternatively, you can create it by copying the [configuration template file](../../mineru.template.json) to your home directory and renaming it to `mineru.json`.
- Below are some available configuration options:
- `latex-delimiter-config`: Used to configure LaTeX formula delimiters, defaults to the `$` symbol, and can be modified to other symbols or strings as needed.
- `llm-aided-config`: Used to configure related parameters for LLM-assisted heading level detection, compatible with all LLM models supporting the `OpenAI protocol`. It defaults to Alibaba Cloud Qwen's `qwen2.5-32b-instruct` model. You need to configure an API key yourself and set `enable` to `true` to activate this feature.
- `models-dir`: Used to specify local model storage directories. Please specify separate model directories for the `pipeline` and `vlm` backends. After specifying these directories, you can use local models by setting the environment variable `export MINERU_MODEL_SOURCE=local`.
---
\ No newline at end of file
# Using MinerU
## Command Line Usage
### Basic Usage
The simplest command line invocation is:
```bash
mineru -p <input_path> -o <output_path>
```
- `<input_path>`: Local PDF/Image file or directory (supports pdf/png/jpg/jpeg/webp/gif)
- `<output_path>`: Output directory
### View Help Information
Get all available parameter descriptions:
```bash
mineru --help
```
### Parameter Details
```text
Usage: mineru [OPTIONS]
Options:
-v, --version Show version and exit
-p, --path PATH Input file path or directory (required)
-o, --output PATH Output directory (required)
-m, --method [auto|txt|ocr] Parsing method: auto (default), txt, ocr (pipeline backend only)
-b, --backend [pipeline|vlm-transformers|vlm-sglang-engine|vlm-sglang-client]
Parsing backend (default: pipeline)
-l, --lang [ch|ch_server|ch_lite|en|korean|japan|chinese_cht|ta|te|ka|latin|arabic|east_slavic|cyrillic|devanagari]
Specify document language (improves OCR accuracy, pipeline backend only)
-u, --url TEXT Service address when using sglang-client
-s, --start INTEGER Starting page number (0-based)
-e, --end INTEGER Ending page number (0-based)
-f, --formula BOOLEAN Enable formula parsing (default: on)
-t, --table BOOLEAN Enable table parsing (default: on)
-d, --device TEXT Inference device (e.g., cpu/cuda/cuda:0/npu/mps, pipeline backend only)
--vram INTEGER Maximum GPU VRAM usage per process (GB)(pipeline backend only)
--source [huggingface|modelscope|local]
Model source, default: huggingface
--help Show help information
```
---
## Model Source Configuration
MinerU automatically downloads required models from HuggingFace on first run. If HuggingFace is inaccessible, you can switch model sources:
### Switch to ModelScope Source
```bash
mineru -p <input_path> -o <output_path> --source modelscope
```
Or set environment variable:
```bash
export MINERU_MODEL_SOURCE=modelscope
mineru -p <input_path> -o <output_path>
```
### Using Local Models
#### 1. Download Models Locally
```bash
mineru-models-download --help
```
Or use interactive command-line tool to select models:
```bash
mineru-models-download
```
After download, model paths will be displayed in current terminal and automatically written to `mineru.json` in user directory.
#### 2. Parse Using Local Models
```bash
mineru -p <input_path> -o <output_path> --source local
```
Or enable via environment variable:
```bash
export MINERU_MODEL_SOURCE=local
mineru -p <input_path> -o <output_path>
```
---
## Using sglang to Accelerate VLM Model Inference
### Through the sglang-engine Mode
```bash
mineru -p <input_path> -o <output_path> -b vlm-sglang-engine
```
### Through the sglang-server/client Mode
1. Start Server:
```bash
mineru-sglang-server --port 30000
```
2. Use Client in another terminal:
```bash
mineru -p <input_path> -o <output_path> -b vlm-sglang-client -u http://127.0.0.1:30000
```
> [!TIP]
> For more information about output files, please refer to [Output File Documentation](../output_file.md)
---
\ No newline at end of file
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
<svg version="1.1" id="Layer_1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" x="0px" y="0px" width="516px" height="516px" viewBox="0 0 516 516" enable-background="new 0 0 516 516" xml:space="preserve"> <image id="image0" width="516" height="516" x="0" y="0"
xlink:href="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAgQAAAIECAQAAADMC/4dAAAAIGNIUk0AAHomAACAhAAA+gAAAIDo
AAB1MAAA6mAAADqYAAAXcJy6UTwAAAACYktHRAD/h4/MvwAAAAlwSFlzAAAWJQAAFiUBSVIk8AAA
MLJJREFUeNrt3XecXFX9//HXmfSekN4I6SGVhJAAMQkkFAEF4QsCoiBFaYIIyE/Er/IFUbCioIB0
BemKikIoAjZAQGkhodcACSSkk5BkP78/Npvtm5k5955zZ+b93McDdjZ77/mcuzvvvfUcZ4hIpcvF
LkBE4lMQiIiCQEQUBCKCgkBEUBCICAoCEUFBICIoCEQEBYGIoCAQERQEIoKCQERQEIgICgIRQUEg
IigIRAQFgYigIBARFAQigoJARFAQiAgKAhFBQSAiKAhEBAWBiKAgEBEUBCKCgkBEUBCICAoCEUFB
ICIoCEQEBYGIoCAQERQEIoKCQERQEIgICgIRQUEgIigIRAQFgYigIBARFAQigoJARFAQiAgKAhFB
QSAiKAhEBAWBiKAgEBEUBCKCgkBEUBCICAoCEUFBICIoCEQEBYGIoCAQERQEIoKCQERQEIgICgIR
QUEgIigIRAQFgYigIBARFAQigoJARFAQiAgKAhFBQSAiKAhEBAWBiKAgEBGgdaiGXL1XOaoAh8MA
A3IYAIbDUVXne6q/Vvfrhm3+r2thiZrlar7eeInml2Pzko17UFMLdb4vt/lVwyXKXc1PomrLdqiq
s/3qcpu3Ts3PoPY7an6SYeuuqWrT5t+72p9orl/VECYwMTfShtGVbjg2sJx3WWSP8JR7jcW2inrV
V21eY1VK1YbZNsGCQCTjtmEmn2BvNwHqvf3a04XB4A4EVvIP5vEA82MXmzQXKou1R1B+ymiPYKg7
jtnMyHMFr/EEv+auctojwAJ9NPWjcORwm38hqj9zm79a93uqv1b3627Lf1taomY56i1Xd4nml2v4
i1zz77W11P2+HLnNa68sNT+J2u1Qd/vV/cjV+SlT7zuIsOVymz9aVbfe3p3r3nBW4MfK3J/c9Nrf
rfROtgV6fyoIFATFKocgcLNzjxUcAjUfyzk/171mjWkJ8/7UVQOpXF04m3ttWtHLd3Pf4nZ2it2N
JCgIpFINst+479HWbyU2193GZ2N3xZ+CQCrTULvRDkhkTUPc9Xwzdnd86fKhVKJtuZWpia2tvV3g
4HuxO+VDewRSeQZwjSUXAwDYtzk6drd8KAik0nS1K21u4mtt537C7NhdK56CQCqMO419U1lxd3c1
g2L3rlgKAqksu3F6ausezgWlejuJgkAqSXfOo1t6q3dHcnDsLhZHQSAVxB3jZqbcwjl0jt3LYigI
pHL04+zU25joDo/dzWIoCKQEuC3PB3h9nEyvAKUeQ9fY26twCgIpAUaV/0cn2zNIsdP5TOztVTgF
gVQIty+TwzTEvrSP3dtCKQikUuzv+4BRvtw+DInd2UIpCKQyDGB8sLa6MjF2dwulIJAS4fw+xjEm
YLFzS+3GIj19KCXCc+y+bV3A43Y3zdqxLlx7/rRHIJVhVNDWJtA9docLoyCQStCRkUHba822sbtc
GAWBVIL2bBO4xRJ7DlFBIJWgTZqPGjWpR+wuF0YnC6WkFHkyvm3wIOgeuD1PCgJJVjt6sB1TGUM/
BtIBA1ayhNd5jv+yiOVsilBVLvhveom9s0qsXMmwHOPYkX3Y2zX667tlirNXuJeHeYzXi22kyIuI
61jqBgfdGiV18VBBIEk5wB1us12/lr/JhnOiO5H57q92Bw8HrM7YGHh7LArcnicFgfjbx53CbDrm
/f3jGOcOZx4/5clAFX7EksDbZEXg9jzpqoH4GcAv3R3sU0AMVOvFEfyFrwR6Tm8NbwXeLi8Ebs+T
gkB8fNr91U6kQ5FL9+ESbmF0gDqreDnkZuE9lgVtz5uCQIqV40xu830bu/35AwcWsVyhHy+yIdym
sQf5KFxrSVAQSHHauWvsB7TzX5GN5iZ3UsFLFfqxkDcCbp3Ho1wk9aAgkGK0dVdzVGKP2rbjZ5yQ
csUv8VTKLdTaGPSKSCIUBFIw19ZdyRGJrrK1u5TjUi3aeCjV9ddt6t/F3ycRi4JACua+ypGJr7SV
+4nbPc2q7S+8meb667ir1E4VKgikcJ+x76Sy3i78MtUrCK/ZHSmuvdbr3BmknUQpCKQwQ90P6ZTS
use4s1Md4usXQW78/R0LArSSMAWBFCT3LUakuPrPJ3zuob5X7MoU115tiV2eehspUBBIIfa2Q1Nd
fyvOZUCK67+Yd1OtH67jpZRbSIWCQPLX1Z2d2mHBZm44R6e4+lft3DSrtyfs+2muPz0KAsnfHqQ8
lzCAO5XeKa7+am5Lbd2rOYflKdaeIgWB5KuV+1SQ35c+HJbi2jfZd9I6mWdncG+KladKQSD5GuH2
C9OQO6TgZxkLscCOTuXv9hXcmGLVKVMQSL72sD6BWprM1FTX/5idkPRDQXaVncGaVKtOlQYmiaxk
ZsbKBZzsu5Obbn9LtYVbrK27PLn9DruCU/k49e2SIgVBZJ4TeYXT1U0P1pZjV1zKm+Y3vM+lDE9g
TWa/5MzSjgEdGki+ptA5YGsjA0wsfg8H8VfvtSzhZE4ttaFKG1MQSH52CHoUMzDIlGHP2MH2HVYW
vwK73w6yy6gKtl1So0OD6ErkLMGEoK11c0XcX1jEscSHnGd/c99lGm0KXvY9+wFXsSq5Tqd9NNQS
BUF0VhpREHYKL2e9grX1EJ+wg9wpTC3g4OcVu5lLWJxYd6OfK1IQSH76Bm6vS9DWfsfvbH/3Saay
01a+80P+YfdzZ1JjG+QyclyhIMiAktgnCD13YNvgPfyj/ZEBTGAcOzGOoa7e/oG9zfPM5xkW8h/f
KwSOmp957P2AWgqCTCiBKAj9OxvnPfIO7zCPdnSlI11sG1phONa6paxlVTLnA1wGDgQaUxBIfkKP
yhvzetZ63k9nxdmNe10+zIjs/Y1owOMiW1HWxu5w8rIbA9ojkHylPaBHA67o4T+zF6lZDoAaCoLM
yN4vcD2vB/113mCJXZqTfOjQQPLzbNDW3g06L5EoCCRPTwVtbVHpTRHSlFI4KKimIJD8vMSrAVt7
PvjJyYS5EgoBUBBIvj6yB4K1tckejd3dSqMgkHzdHux85juEC50UlNa+QDUFgeTrGZ4P1NK/eS12
Z4uXy/r1n2aqFsnP+/b7MA3Zr2J3tRildlagPgWB5GsTt/JW+s3Y/dwXu6vFURBIZXg2wDy/VVyU
1L51uF/uUo6AagoCKYBdlvatxnYniVwxcHX+m7bSjwEFgRRmgX031fW/w/ms9luFa/TakdYRfGmf
F6hLQSCFucL+kNq6zb6f3h2MSb9lXQrrjEdBIIXZxKmp3WN4C6lfL0jiF7583v61FARSqDftSJam
sN5/2XF+g4Dl/wb1eSuXz+FAXQoCKdw/7fNJR4EtsGP85g4s7u3Z/MBhVmetaZ5nyAYFgRTjHjuC
DxNc30K+wAvFLpzz/Atf+99cnTd8yOsO8SkIpDjz7LO8ndC6/m6H8mTsDlU2BYEU6347hH8nsJ4b
7QCeKW7Rct5ZD0tBIMV71D5lv/Q6wfeenWLHFn+QkcWBwUuTgkB8vM+pdrA9VtzCdovtyaWsL2ZZ
7QkkS0EgfjbxJ2bbUcxnYwFLrbAHbV+O4LnY5Us1jWIs/tbza7uJo9wsZjBsq9/9Ag/YbTxUbGPZ
miqsXLhQm7T+rlz11I8113CN2uEcDIejqs73VH+t7tcN2/xf18ISNcvVfL3xEs0v19TRp9tSn235
vPr7cptfVfqvp8PGuZ1sGuMY4/rU/zeD13iahTzNE7zs0cJWfrI1Pzm3ld+Rpr9ef4mWfne23kLt
OlwTvahZqrae5s54hPmt0h6BJGk+87mOnvSkuw2gJ+0BY5V7j2UsY7Hf/EWlOfZPaVAQSPKWpnIL
cunqTDs60Io2wMdsZB2r2BC7qPoUBFICqnegS043RrA9kxjPALalE22ANaziHZ7jOZ7jpeIPkpKm
IBBJ3vZMZ4oby0TXu8G/dKKT9WOKA1jg/sHdBBoJsmUKAsmwmhNwJaMDu7O3m8YABuVxaX57trcj
3NN2Bb+OffpDQSDiqy3d2JZZfMpNpqO1K2jZjuziprkj7Tz+FjMMFASSSSVyt8BAtmMMuzHdjfSo
t5XNcbPsZ/yAJbE6oiCQzCmBJwjaMZXpbpztwFjXPpE1tnZnsCvHsiBOhxQEIoUYym7s5sYziD7+
K2tgF3eXncw9MbqlIBDZuo50YyJz2MONoJO1Sq2dYe52O5ZbwndQQSAZkrn7BXIMYSRjmcsM1yNI
i53c5fZB+ElgFQSSARk8NdjfJrmdGG/j3LjAtXV3V9qnmR+2uwoCyYQMnSCczF7s6kYwkG6RKhjq
rrJ9WB6ySQXB1uTQg4Upysy+QFe6M4N92N11p3PsYtiZs/hmyAYVBE1pzQAGMpBt6U87oIpVLOEN
3ubdtOf+qxwZ2QfozBBGMI05bueMRBIA7lS7h7+Fa09BUF9fdmac24FRbO/aNvrX5bzonuI/zOP1
2IXGkNxfb1dn/IBoxrCDm8Bkm+QGZCkCNuvkLrQ5rAvVnAYmqf53xy5uLtPYjmF03GpnXnD/sGv5
Z6UNTFK7BZt+47gtW7zh3/vaV2kPG5JHC13Y1e3FeEawbab/EG7i83ZzqIiq7CBo5bpU9WfX3F72
CboWeGS4xt1bdS7PVGoQVPe4vuaDgPzfpqQUBK1dl6oh7JGbbdPpTIfY2zIv8+wA1muEojT1YQjb
MYdZbmyRmdvJDnS72I+4JtEZf0pK3UG76kd93aG3crEjsjfbMZrZzHSjM3gI0JIp7Mi/wjRVaUHQ
lslMdBOZbJNdR+9fi37uR7Yn/4+nY3crrvoj7jlaHoMvmLaMZ7KbwI5MSeBnHUNvZioIkrYdc9nJ
7cAQ+iW5Wrc3YzmTW2N3T+roz2zmuvE2mIGxS/G0J5ezIkRD5R4EHejFJHZzMxlu3VLq7WC72rXi
pthdrXjt6cRY9mOu246utPVfYXxudxuoIPDRnmEMYmc+4XalU+o7hZ3tMmfcHLvTFWsIQ5jAXuzS
aGCwUpdje54P0VD5BcEwpjCWiezCgIDTYnXjF7Y+G6PPVZDu7MhYprAjE8p2CrRPcEeIZsonCDoz
jbluKtsxlDbhfy1sG/dLW8DC2JuhQkxkhpvFKBvmupfkacD8DQrTTKkHQRs6MZB9meMm0tXi3iPe
z11te/BR7E1SxjrShV3Yh5muH11Jb1SADHEjwrRTukEwkIFMYjaz3aDM/E3Y1Z1l/xe7iDLUiQGM
ZBqzmekq4u1fR/cwzZReEHRhMmPYkR2Z5LJX/an8gadiF1FGBrET45jKRDc0M3EfVqBOZ++t1JxW
jGOGm872NpyeGXlyrbFt3LF2SuwiykAbZjLD7cpgG+naVmgEBJX9IOhED2Yw182gn3UrgePCw7mq
0u80LFqOjgxiFvu58fSmS+xyKkl2g6Ar/RnFLnySca6Ubg7pySEKgoL1ZCCj2J1PuImxS6lMWQyC
UUxhe6YyyZXmDaIH8zPej11EyZjAFEaxK1NcVx0CxJOtIJjDvm4SQ2xk9EErPLjRNoG/xq4i8/qx
Gzu6KQxnSCn/tMtFNoKgNWP4AgfRt0yOC/dTEDSjDR2ZxAz2cqNtGwqbJVBSlIWBSSbb6e7wEjgN
mL+Ftn3sEtLQcGASmhk2pMmv92ewjeWTbnayT3+WvdeqhoVoJuoegeH6cbI7gV4xq0jBAAbzVuwi
MqIH2zOJSezI1LJ9HqAMRA0Ct6ddyJQyPDpsxyQFASOZ66YxzrajjyIg62IGwZf4cZmcE2ioTckP
iFG89nRkFru7XdnWepbVAV9ZixUEbflf963YnU9Njp6xSwiuPX3Zjl3Yhyl0LsO9vDIXJwjacJE7
LXbXU1VuZz1aMpjxjGU6OzBShwClKkoQuJPstNgdT1nGJvVNyVTmuClsbyNdB90JUNpiBMHnuCh2
t8VTJw7naEZb5R0ClangQZAbZz/WjSQlrQ97urNtXOwyJEmhg6A1l+mGkpJ2mDvNpuswoNwEDgJ3
uM2M3WUpWncudMeZLgmWobBB0I9SuGRorOc1u5v+7jB0GrzWRLuSabGLkHQEDQJ3NKNid7hFS1ls
85nHo24hm/gKh8cuKENmulvoH7sISUvIIOjNIbG724zVvMjT9ixP8F9Wb/mqdoFrzXI3KQbKWcgg
mMv42N1tyF7kQf7LAl5hUb2vxy4sW2a4W+kbuwhJU7ggaMOetInd3c02ssr+xr08ynu8VyE3/xQt
tw0XmGKgzAULAtebObE7yxqW2Os8wr08ysf6s5+nn9rs2CVI2oIFgW3rtovYz5dZwEv2Dx7jnYhV
lKIj7MjYJUj6wh0azIrRPVvKozzunraFLNSVwCJs674fuwQJIVwQjA3YK2OtvcA8HnXzbQmrArZc
ZtwJDI5dg4QQ7hzByACNGO/ztv2Hh/knb7ExVN/K1gg+G7sECSPcHkHvVNe+jOd41R7nMZ4M1qPy
9xmGxy5BwggXBAPSWKnBk/yLp3iBZ3QIkCzXjU/FrkFCCRcEyba0yd7jn9zDkyxmcbA+VBSb7GbE
rkFCCRcEyVy1X81i3rC/8CALWK9bgVLk2C0j099IAKXzo17ISzxnj/CQ0yEApP9YZFs3R3dcVY5S
CIJl9nOe4Vle1p0AtVJ/k3Zlcuw+SjilEATvcRHrYhdRccbQOXYJEk4udgF50a9keKlc5ZGsKo0g
0DFBeCFuAJPMKI0g2EwnrwLqEbsACamkgkACysrYERJECQaB9guC0GauKCUYBCKSNAWBiCgIRERB
ICIoCEQEBYGIoCAQERQEIoKCQESoqCDQrXIizamgIBCR5igIRERBICIVEgQ6OyDSsooIAhFpWZkH
gfYFRPJR5kEgIvlQEIiIgkBEFAQigoJARFAQiAgKAhFBQSAiKAhEBAWBiKAgEBEUBCKCgkBEUBCI
CAoCEUFBICIoCEQEBYGIoCAQERQEIoKCQERQEIgICgIRQUEgIigIWqJtIxVDv+zNW+WxrKZYkpLS
OnYBGbYcwxW3qGufdnFKGkmS9gia57NtuscuXqQQCoLmFbk3IFJ6FATN2+CxbJvYxYsUQucImreM
KloVuWx3HcNLKdEeQfM+9li2Y+ziRQqhIGjeEqqKXlZ7WlJSFATN8ztH0C52+SL5UxA070OPi/Xd
GBi7fJH8lf0urMc1wI9ZzTZFLtuatrF7LpI/7RG0ZEXRS3ahd+ziRfKnIGjeJpYUvWx7usUuXyR/
ZX9o4HE9v4rFRR9YtHPFHlSk3y+RRrRH0LxNvFf0sm3oE7t8kfyV/R6Blw89lu2tv9lSOrRH0JJ3
PJbVHoGUEAVBS171WLaPbjOW0qEgaEnx5wigDz1ily+Sr3BB4NOSx+G28/lY7fHgUT8FgZSOcCcL
1xR9r50r+mFgX+tYwqAilx3geqVbnE5GVoRAw+OEC4JlRf+FbEdXFhfbrNfbZSWvuGKDIGcDfJoW
AWBdmGZK4dAgF22PYI3XdYOxkaqWMmI+l7ALEC4Iiv/THC8I1vOSx9JD9OCReAt0BBguCIpPtg4R
xwT22SMYpwuI4m1TmGbCnSMo/vx7K7+hQL3Otrzv0e721pPlPo2LsCxMM8H2COyDohdtS2evln0+
3vX4QbRn2/S2p1QIn3GyChDu0KD4s5+eewReFvGWx9KTotUt5cJjn7QQ4YJgZdFLdog4yMf7HmMS
wIxodUu5WB2mmXDnCJYXv6jr5NOw12nXj3jTo+4drTUbfZqXirc8TDPhgqD4Yb+gl981FK/ThS96
LNuX4bzgVbpUuqVhmimFy4f4nSz0PF34ImuLbrgtU9LZmFIxiv/tK0i4IFjksWzXYFU29oxHhLV2
c70eetrKh1SAd8M0Ey4Ilnss2xevswReXi3+OQdggnX02h9p8UMqwPIwzYQLAp9jnW5+QeD5d3eh
R9MDGZHO5pSK8LELdNUgWBC49awveuEedPFp2/Mv79MeTQ9wO6e1RaX82RIru3ME6zxujdjGLwg8
Pehxv7fTTUXiYZnXnNwFCHeL8Wor/h69vm6biKfVnvW64jFZcx5J0d4vv/EI1vvcLGmdI55WW2+P
eyy9A6NS2Z5p07nILCjLIPA5XVjsOEFJMB71WLoD20esvXia8SIL3im/INjA6x5Lj4562fwBr6UP
pENahaV4l0Lf4FtZGvN50qUgIYcz99kj2Jb2AStt6A3zuK3DzSzJyU6qYhcg4BaF+vsX7vIhzifd
BkYNgg/5t8fSXZgdsfZipbYXI3n72JaGOlUTco/gveIfPHKjI95bCGv4h9fy+6ZVWIq/Jj3TW7Xk
abXXo3oFCXf5EFvh0a2u9AtVaZOe9rgdCje3BCc7ibu9BeCdUE8ahN4j8OnWmICVNjaf5z2W7uE+
nVZhKT3H0CbayNFSa2moh5BDnyz0GRN4h4CVNvau17gCrTg0avWFGxD1nIxUe5dVoZoKGQSbvI54
RgestDHjEa8D8h0YH7X+QnXU9LgZsCjcbV1hf9yvFL+o2zFopY3Y/V5zIw9wn4xbf4GG6KpBfB43
5Rcs5OVDeNXjEYquDA9Va5Oe5zWv5fdIZ3iVlG4nGqRDg+jWeo2gXaCQVw2wVz0ODtoyIVStzfTg
Qa/FZ6VzcJDSvuMA3WIc3VL+G66xsIcGL3gEQRumBq21sTu9RiTuQEoHB6mMfTQsyhaWuj4KNcsR
hA6CDz1mN8DtELTWxp70GtEYDiihuwn0pEF8r4e7nSh0EMBTHssOjTqIKZjd6LO4m5jeHYYJ6xxx
2lnZzGtkrIIFDgJ7zOOgtk/kewngz36Luy9Frj9fg0vyManyUuX1R7NgofcIXvZ4qq2ni/1k/6vm
98zBdHZKvqgUrhlsqxuMo9tEOe8R8ARril7WMTFwtQ2t4h6v5du7rydfVAonCwfSMdIWlhqrw86R
FfQ+Aodb7XXCbZTvnEfeHvIavxBmRX5mIj9jYxcgPOY2hZzGJux9BFiV/ctjJRPcyMhzAz3uNWwZ
9HGHJbtdU9CRkbFLEHvULOQ0NuHvKH/CY9m+tl3kuYE+5m6v5R2HsF2i25PEzxL0d7EPwQSeCdtc
+CB4zmsQrBlRxy4E7F7PGz/HcmDiNSX70S/5qJICbeTZsA2GD4J3eMlj6V0i30sAL3geHMBXMn5O
flzsAsSeCTXnYY3wQbDcPO6gdjsyIHjFDdj1Xrca44bx+WQrSvTAoI3bPebWFQAe8TwpXbDwQbCe
Jz2Wbsf04BU3dLfXA8mAO51tkiwo0QODDhnYwvKsxzR7RYkx/MSzXn9R94xQcX1VdqnnGvpzcuxO
NGs0Q2OXUPHWez7VUoQYQfC6z1Qnbm7UCVGr3egzfRuA+yL9kysnyUODtJ6RlAIs8DqPVpQ4QeAz
EGjvDOwTvGt3eK5hqDs+uXKSPDRw+0XdsgKwwGt0z6LECIL1LPSp2M2NUHN9m7iJ1V5rcByayYt0
w3UzUQY8E36eqShDVNqjrPVYfGYGxtP7D494rmGM+2zsTjRhH7rFLqHiLTefebWKFGes2n95nXcf
zh5Rqq5rNX/wu4gI7pTsPXfg5mg+g+gW81j4RuMEwWKvswQd3W5Rqq7HbvC998sGuXNi96KB0UyJ
XYLYcx5P6BYt0uj19nevxacnex2+KCvsCu91HM7OsbtRzyy2jV1CxaviTzGajTWNxd2s81h6avSx
igBu9T6328r9JJnzHYlcOmzrZsV+kkNYy8Mxmo0VBC96PbrTjtkZmInnQ7vKex07cVzsbmwx3JXK
mIrl7Fmfu2yKF+vttN7u91ncHZCJ4TVv9J6ksrX7ehLDgCRyD8FUi3/AVfHsrjjtxvu7eoPX0hMj
z3tU7UW72nsdg/ly7G4A0D53ROwSBPD6A1m8eEHwktf91I6jolVe1/Us8V2FOzkD90rCeNs7dgli
z/FqnJbjBcEK+5vP4m6vDFw5gOfxmusAgNa5izJwoPO52AUIcJf3wWaR4gXBx547QSP5dLTa67Cr
WOy9jsnu/MjdGOy+ELkCgY38M9gghQ3EPPf+BC/7LO4SHt6jSM9zWwJrOS7uGXv3RXrFbF8A+G+M
ewqrxQyC1z3ncpkUfZ4DAOw7vOu9kvacH/Gt2IEDorUttZ7xfby9eDGDYBP38rHH8r0zsk+wzH6c
wFqmuG/E6oD7XDYitcKt5b54jUe9Lcce8Bz0aw49Y9a/xVXmM/zaZu7Lcc56uN7uGNrEaFnqedM8
59b0Eff+vNc8R2+fRDYuea0ggX0C6+J+EWOYMNvVsvXEQ4Wyf3qOceElbhCY3eh1lrQ1h9M+ag9q
/IW/JrCWwe7i4I8Bt8sdm4HbtaWKm2I2H/tX4E9+04W4mYyO3INqK/hRInm+H18LW7ibY3uFbVGa
9Hq8KwYQPwjWeN5b3c0FfuM0x+4miSO8Vu7coPcZtnWn0C5ge9IMuyXmgQG4UPcvNPt86y7OZ1pU
+MhGhB/qsUkj3OOJ3CH4ktur6vXCFskVObOjO5YrNCZRBpiNb26wnjDv0Nh7BLDA84RhB7JyT9zL
9r1E1jOSX9ApSMXtOEkxkAX2cJyHj2vFD4Llvk/1u+MychERfuU9pCkAti/nhijXHcWEEO3IVt3h
NZxvAuIHATzseWfesCTnCPCywr7tO6RpNXeqOzL1aofyTd0/kAkv85fYJWQhCJ7jH559ODwzY+09
ZNcmsp62/CI3d2s/HEfOZ2yx4xkSbsNIC/4e6+HjWlkIgirvocHHk5X5eTbyw4TmrevMxWyfYqUz
3AlhNolsxWq7PXYJ2QgC7M+84LmKIwKdXtu6l7ggmRXZeK5jYEpVutwFmsokI17godglZCQIWG5/
8rtK4mZkZp8Au9HuTGhN09wlhVzlz/cwweFOsdmht4s0aaP9OvaJQsjCfQTV+riX6OrVwLM2JZkT
dQkY7B5LbLbjn9rpzf2Tw22+f6D6PgKXZ5q6CcxLcjZm8bDSxrR8srxS7iOotoQHPNcwgewc875l
30psGsuvcVHC1bXiHMVAVtgNCYxmkYCsBAH2Q68pTwB3embuJ4Cb7dakVuXO4ltJlpY7mkPDbxBp
0nouiV1CtcwEAU96ToMGQ9yJsTuxxVrOTujqAeDOd6fWzkjkabydG3XLSB12X/wLh9WyEwQf81vv
vmTnfgJ43c5lQ2Jr+6E7NJEg6O5+kdqVCCncVV5jdCUoO0EA9/C45xrGckjsTtRxk3e01Wprv8pn
h77lH6cjdwazYm8W2WKe95mxxGQpCN7jJt9TbLmT3IjY3ahlX0/m2QMAutplHOy5jgMtIw9tC7CB
yI8e15WlIMBu5jXPNQyzs2L3oo737XRLbsKKHnal1x7PeH6UmduuBF6LNc9hUzIVBLxrf/RdhTua
KbG7Ucej/G+Ca+tu13BYMQvmyHVxlzAs9uaQWnZbvMHLG8tWEMCVrPRcQ2v3I1rH7kYdl/GbBNfW
2a52JxexXGsuYLfYm0LqWMLFsUuoK2tBsCCBIRxncmzsbtRlp5HAYOdbdORSvl3wz+0cOyX2dpC6
7Od8ELuGurJyi3GtUe4x7wG/XrBPxh7xpZ4p7r6Ep2z9iZ2JNb7F2DV5ttXtz810iL0RpJa9xbR8
5/SorFuMa72YwBP9ozN0uzHAf+yrCf88T3cX5z2Q+z5cpxjImJ97Tu2TuOwFAVzjv9PkTkvqern3
nXzVbrCk5zs+ld8woKl/yNX/GOOuoUfCbYufF8nACAT1ZTEInkvgPEE7zkmqbwnc1gvwY36fTD1b
6jrY/ZFRW/mm/lxLv2TbFW+/zdSBK5DNIMCu8r+w4vbyHcnQNfN5kVbaVzwHbm/EdrTbmdPCNwyw
GzSdWea8ZlfHLqGxTAYBz/C7BNZyLpN9Fk/ooKDWO3Y8ixJe5wRuc19u5t/62VUtxoREYTfyduwa
GsveVYNqfdyz9PFu9I92EJuKXThHVZ26E9pOe7vb6JLMqrbYZD/NfcfW1lw12JztvbjOMjNqk2zx
tk0u7BxYpV41qLbEEphfmP0p+vDAFfj1PM2zk3zHXWiklTuTmxrcNdiL6xUDWWQXZOv+gRpZ3SOA
Tjzi/KffeMf2YEFx9bom9wgS2EP4X3eed78aW2hfc/ds3iPob9dmZMJ4qcce5EBWFLhMkMqyukcA
a/hRAmsZ4C5MZxIPj2sJP7A0RqUZ425w59Ia2NZuUAxk1A8KjYFQsrtHAD240yVwN4B9lZ8XU2/L
ewRu81eL2n6t+ZU72r9nTVT9a7vNfcNmpLFu8WXz+B/WFLxUkNqyHARwgLs57/vnmrfEPs2/C683
nyCo+ZeCt2KX3G/sAO+eSSlZYwcUMxBJpR8aAPwxkTnh+riLXe/0iizqIGGVfSXRR5Ek+27IznhE
jWU7CMy+7f1YMsAuaY/MU/gchPa27e85IbyUkvfsh7FLaEm2gwDmJ3IZEfd1Ppt2qQXvGbxjh7Iw
7aokG+wSXoldQ0uyfY4AoKd7gu0SKOAtt2dVXjMs1txIlO85gvr/XuAx3Vh3e6pTnUo2PGvT+ai4
RXWOoNpSOyuRWYMGc2UCJx6T9rwdzPOxi5CUrbZvFBsDoWQ/COD3ycwaZDNdQvMU5yfPfaDn7UDm
h6xLQrObuDt2DVtTCkGwkR8kND/c6XzZfyX5y/OswYvaKyhrb3BxoP17D6UQBPBfrkiou+e4aS1/
R9LPHLp81rrQDua5hBuWrLikFGI++ycLq3Vyf2Z2EnXYc3yypYeBa08S+p4srLtkbuunOUa6G5i2
tW+SknOP7eO3Ap0srGuNnVn8A8V1ufHuEtqF78BW9wxe4iCX2AzKkhGr7dTYJeSnVIIAnrDvJ7Sm
A90vkh91JD8tnjVYxFF2ZZy6JB12Pi/FriE/pRMEcClPJLSmY3KnREoCoIX7ENdxkv0kYmGSKLsv
qXNb6SulIFhsZ7E8kTU5/tcOjN2dJic538g3OS8rU2WLl2V8J6sPHTdWSkEAD5LQsI/Wy13JJ2J3
B4wcDc4crLfv2GezNCueFMXsxwnOhZ26UrlqUKOju49dEyrpRduPlxvXmc5Vg7ozEtXMS1QdBFXk
MKg/S9FcdyN9E+qnxPA324v1SaxIVw2astZOSWxO+VHuOvrH7lB9W34cD9hu3M2G2PVIkZbbkcnE
QCilFgTwH0tuxL8Z7lZ6xu5QQw5wuNfc/QqCUmXf4I3YNRSm1A4NADpyi/tUYmu7277A0rp1xjw0
yFGFw/q649mLaemMtihps5s5KrkTvhqqrHlj3J8YkVhll3NyzeF5VewgaFc1xp3AfgxOcnNJUE2e
eyqegqCltR3F1bRKbG2XcEb1bnjUIBhTNT13hM2mbaKbSsLaYAfy5yRXGOYd2jpIK4mz6914zkxs
bae4FXw74hNiw5nCnhzkemb+ITXZmvOSjYFQSnSPAOjEn10ijyEBsNH9H9+NsEcwIjfLdmcqY5Le
PBKDPcTuia8zSOWlGwQwzt2X4OW/Kvd/XFC1KUgQtLJ2Nsrtxz4Mp1ep7pVJQzafzyR5dmDzWoPU
XspBAIe53yT5NnK/rfq9e8ktrlrGx6kEQcdcL+ttE3K720yGlOClW2nJGjuMu5JfrYIgn7VexFkJ
r7LK/bdqIQvdy+6tqrdY4jZPWeoRBJ3omxtio2wg43PjbFQqG0Li+2Ziz8fWoyDIR6fcjanNGPQB
7/IBS3mVxbxpb/MeK1mWRxC0ogP9GOQGsQ0j6U0/ejOQbilVKZlg13NqInNwNF5zkPpLPQhgsLuH
samX/xEfsY4NrONDVvE+uOWsrHMB06wtPehOd7rTiVa0pwMddSGwYsy3XdOJAQVB/nZ28+gaqBsi
jb1pB/N4WivXQ0f5etROLq0HPKSsrLcT04uBUMohCOC3dmEik6CIFGqTncu9sYvwVw6HBgBt3HV8
LlBXRLawCzkn3T9COkdQmK7uLmYG6oxItXn2P6xJtwmdIyjMSjs8scFNRfLxTzsu7RgIpXz2CAB2
cHcxMFCHpNI9Z3vzTvrNaI+gcE/ZiSyLXYRUhGV2RogYCKW8ggD+ZCdlfQJqKQOr7aRyuFZQq9yC
AG6xU9za2EVIWVtlX+KW2EUkq/yCAK62i2KXIGVsg53FzbGLSFo5BgF2YelMNSUl5+dcHruE5JXX
VYNabdyVHBW2SakIl9opYRvUDUV+2rlfcWToRqXM/dqOZ13YJnX50M96O96ujV2ElBO7NnwMhFK+
QQDrOJHrYhchZeM6TizXGCjnQ4Nqbdyv+GKcpqWsXGdfjjMFnQ4NkrDBTkAHCOLrWjuhvGeiLPcg
gPV2ol0duwgpZXa1nVjuQ9+UfxDAek7mmthFSMm6hgoYAasSggDW2/E6QJCi/NxOKP8YqJQggI12
ApfFLkJKzDousq+W97mBGuV+1aC+8903NMGY5GmdnZaFW9V1Z2EajnEX0yV2EVICltlXuCl2EaAg
SKkKO8RdpVkQZCs+sBO5PXYR1XQfQTpusy/yWuwiJNNetP/JSgyEUnl7BAATuMLtErsWySZ7mC/z
Yuwq6tQTpJXKDALo425iTuxqJIP+bF/kg9hF1KVDgzQtsQO4nE2xy5BM2WQ/sUOzFQOhVGoQwGpO
tLP4OHYZkhkr7UzOKJd5CgpVqYcGNZ8f7C5keOyqJANesVP5S+wimqJzBKlUYQ0/n+RuY2TsuiQu
e4STeCp2Fc3UFqSVyj00qPG07W7Xxy5CYrJr+WRWYyAUBQEs4gS7kFWxy5AoVts5HM/K2GXEpkOD
Gp9xP2Pb2PVJWLaQM7J5ZqBOjUFa0R5BjTvZ2+6JXYSEZL9j36zHQCgKgloLOcjOK9/hKaWeNXY2
h+tm8xo6NGjw2h3MRQyLXaekbIF9jXmxi8iPDg3iuN0OsN/GLkLSZFfZ/qUSA6Foj6DBa4dBK77i
vs02sauVFHzIWXYNVbHLyJ9uKEqliryCAGCW+x4zYtcrCbvfznDPhPqdT4aCIJUq8g4C6M5p7lR6
xK5ZEvKu/Zifs8EFemslRUGQShUFBAHAHHeOHlcuC3fZeTwOjX/uWacgSKWKAoMAevBV91W6x65c
PHxo3+bamucKFQRNURA0eN3kr8kMznPaLyhNVXYPZ/NM7RcUBE1REDR43cyvSc6dzkkMjV2/FOg1
fmKXs7HulxQETVEQNHjd3K+Jw8a6MziCdrH7IHm7wn7mFjT8eSoImqIgaPC6hSAA2M+dy9TYvZA8
PGQXcH9TP08FQVMUBA1ebyUIoAdfdOfQM3ZPpAVv24+5pvrRYgVBfhQEDV5vNQgABvEt90UdJGTS
B/YHzueNmpcKgvwoCBq8zisIAHcIh3Ng7P5IA7fbpTxc9wsKgvwoCBq8zjsIsLYczGlup9h9kmr2
CD/kzw3HpVYQ5EdB0OB1AUEA0J3Pu//HoNj9qnjP2neZx4rG/6AgyI+CoMHrAoMAoCsnuyMYF7tv
FesFu45fNjfqoIIgPwqCBq+LCAKAXu5YjmNE7P5VnDe5zX7Koua/QUGQHwVBg9dFBgHAKA7nGKcB
UAOxJVzOrcxv+bsUBPlREDR47REEAH05yH1DoyGn7m27kutrLxI2T0GQHwVBg9eeQQDQhRPdUYyN
3dcyZfzbbueKfOehUBDkR0HQ4HUCQQCwDV9yc9kzdn/LzgP2W37DhvwXUBDkR0HQ4HVCQQDQ1e1s
x7uDYve5PNhabna/t783dYmwJQqC/CgIGrxOMAhwWAc31g5zx2ggVC9v2q3cwHy3sfDfVgVBfhQE
DV4nHATV39efY92+TNHTCQVbzfN2B5dVnxEo5i2sIMiPgqDB61SCAHBt7CC3N/vSN/Y2KBmvcq/d
7f7Y/E8sHwqC/CgIGrxOLQiq/7+jm2wHun1jb4eM22B/cXfZv3i+5Z9YPhQE+VEQNHidchDgsM6M
Z08+54bqUKGRj+0V7uQmXnFr62+3hp/nS0GQHwVBg9cBgqD6H9uxs/syOzGMVrG3SiZ8zMv2MHfw
YPUsRI22FwqCNCkIGrwOFgTVn/fmU25Xdq3o24828DRP2N+5k7W1X1QQ1FAQpFJFxoKg+pPhboJN
Zp/KG9vAHudu/s6Cxo8NKQi2bKMgrSgIGryOEgTVn/VkNLPZ302gY0Y2V1qqWG+Pcjf38RZLm/4W
BUENBUEqVWQ4CKrl3ADbn93dePqW4byLH/KePcZD7q+2qOUZiRUENRQEqVSR+SCo+bwTO7lPMI4x
bF8GVxfWs4CF9jSP8Sgf5bP1FAQ1FASpVFEyQVDz2SA3ysYwlulMzcY2LIStYj7z+Rev8yJvF7L1
FAQ1FASpVFFyQVD9X0cvt42NZmd2YozrQ5uMbNCmVLHRVvFvnuQpnmUZy+tPOZbf1lMQ1FAQpFJF
iQZB7f9ztGUEk5nGaDeILnSjS+ytCsBK1vC+vc18nuZpXmFdSxtIQZA/BUEqVZR8ENT9vrYMZSTD
3HAG051+DKBb0M35Ph+wlEV8aM/zMm/zbL5vMgVB/hQEqVRRVkFQ97PO9KU/3V0v68cgejGYPnRx
iT7iZCt4iw9ZwmLedu/Yct5lKcv4sOl+5PtzaPk7FARhqm0du5uSkNWs5pXNnzva0YkOtKED7Wwo
PWhPT3rTlS70xtGOHrRik+tG5/rrsBXkyLGSFaxlNatZyQes5V1WuEWsZz1rWM86Piqx95JsVbA9
AhHJrlzsAkQkPgWBiCgIRERBICIoCEQEBYGIoCAQERQEIoKCQERQEIgICgIRQUEgIigIRAQFgYig
IBARFAQigoJARFAQiAgKAhFBQSAiKAhEBAWBiKAgEBEUBCKCgkBEgP8PCPTCCyMAfxEAAAAldEVY
dGRhdGU6Y3JlYXRlADIwMjUtMDctMDhUMDI6Mjc6NTgrMDA6MDDf29LGAAAAJXRFWHRkYXRlOm1v
ZGlmeQAyMDI1LTA3LTA4VDAyOjI3OjU4KzAwOjAwroZqegAAACh0RVh0ZGF0ZTp0aW1lc3RhbXAA
MjAyNS0wNy0wOFQwMjoyNzo1OCswMDowMPmTS6UAAAAASUVORK5CYII=" />
</svg>
{
"bucket_info":{
"bucket-name-1":["ak", "sk", "endpoint"],
"bucket-name-2":["ak", "sk", "endpoint"]
},
"latex-delimiter-config": {
"display": {
"left": "$$",
"right": "$$"
},
"inline": {
"left": "$",
"right": "$"
}
},
"llm-aided-config": {
"title_aided": {
"api_key": "your_api_key",
"base_url": "https://dashscope.aliyuncs.com/compatible-mode/v1",
"model": "qwen2.5-32b-instruct",
"enable": false
}
},
"models-dir": {
"pipeline": "",
"vlm": ""
},
"config_version": "1.3.0"
}
\ No newline at end of file
mkdocs
mkdocs-static-i18n
markdown-gfm-admonition
mkdocs-video
\ No newline at end of file
# 常见问题解答
### 1.在WSL2的Ubuntu22.04中遇到报错`ImportError: libGL.so.1: cannot open shared object file: No such file or directory`
## 1.在WSL2的Ubuntu22.04中遇到报错`ImportError: libGL.so.1: cannot open shared object file: No such file or directory`
WSL2的Ubuntu22.04中缺少`libgl`库,可通过以下命令安装`libgl`库解决:
......@@ -11,7 +11,7 @@ sudo apt-get install libgl1-mesa-glx
参考:https://github.com/opendatalab/MinerU/issues/388
### 2.在 CentOS 7 或 Ubuntu 18 系统安装MinerU时报错`ERROR: Failed building wheel for simsimd`
## 2.在 CentOS 7 或 Ubuntu 18 系统安装MinerU时报错`ERROR: Failed building wheel for simsimd`
新版本albumentations(1.4.21)引入了依赖simsimd,由于simsimd在linux的预编译包要求glibc的版本大于等于2.28,导致部分2019年之前发布的Linux发行版无法正常安装,可通过如下命令安装:
```
......
<div align="center" xmlns="http://www.w3.org/1999/html">
<!-- logo -->
<p align="center">
<img src="../images/MinerU-logo.png" width="300px" style="vertical-align:middle;">
</p>
</div>
<!-- icon -->
[![stars](https://img.shields.io/github/stars/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU)
[![forks](https://img.shields.io/github/forks/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU)
[![open issues](https://img.shields.io/github/issues-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues)
[![issue resolution](https://img.shields.io/github/issues-closed-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues)
[![PyPI version](https://img.shields.io/pypi/v/mineru)](https://pypi.org/project/mineru/)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/mineru)](https://pypi.org/project/mineru/)
[![Downloads](https://static.pepy.tech/badge/mineru)](https://pepy.tech/project/mineru)
[![Downloads](https://static.pepy.tech/badge/mineru/month)](https://pepy.tech/project/mineru)
[![OpenDataLab](https://img.shields.io/badge/Demo_on_OpenDataLab-blue?logo=data:image/svg+xml;base64,PHN2ZyB3aWR0aD0iMTM0IiBoZWlnaHQ9IjEzNCIgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzIwMDAvc3ZnIj48cGF0aCBkPSJtMTIyLDljMCw1LTQsOS05LDlzLTktNC05LTksNC05LDktOSw5LDQsOSw5eiIgZmlsbD0idXJsKCNhKSIvPjxwYXRoIGQ9Im0xMjIsOWMwLDUtNCw5LTksOXMtOS00LTktOSw0LTksOS05LDksNCw5LDl6IiBmaWxsPSIjMDEwMTAxIi8+PHBhdGggZD0ibTkxLDE4YzAsNS00LDktOSw5cy05LTQtOS05LDQtOSw5LTksOSw0LDksOXoiIGZpbGw9InVybCgjYikiLz48cGF0aCBkPSJtOTEsMThjMCw1LTQsOS05LDlzLTktNC05LTksNC05LDktOSw5LDQsOSw5eiIgZmlsbD0iIzAxMDEwMSIvPjxwYXRoIGZpbGwtcnVsZT0iZXZlbm9kZCIgY2xpcC1ydWxlPSJldmVub2RkIiBkPSJtMzksNjJjMCwxNiw4LDMwLDIwLDM4LDctNiwxMi0xNiwxMi0yNlY0OWMwLTQsMy03LDYtOGw0Ni0xMmM1LTEsMTEsMywxMSw4djMxYzAsMzctMzAsNjYtNjYsNjYtMzcsMC02Ni0zMC02Ni02NlY0NmMwLTQsMy03LDYtOGwyMC02YzUtMSwxMSwzLDExLDh2MjF6bS0yOSw2YzAsMTYsNiwzMCwxNyw0MCwzLDEsNSwxLDgsMSw1LDAsMTAtMSwxNS0zQzM3LDk1LDI5LDc5LDI5LDYyVjQybC0xOSw1djIweiIgZmlsbD0idXJsKCNjKSIvPjxwYXRoIGZpbGwtcnVsZT0iZXZlbm9kZCIgY2xpcC1ydWxlPSJldmVub2RkIiBkPSJtMzksNjJjMCwxNiw4LDMwLDIwLDM4LDctNiwxMi0xNiwxMi0yNlY0OWMwLTQsMy03LDYtOGw0Ni0xMmM1LTEsMTEsMywxMSw4djMxYzAsMzctMzAsNjYtNjYsNjYtMzcsMC02Ni0zMC02Ni02NlY0NmMwLTQsMy03LDYtOGwyMC02YzUtMSwxMSwzLDExLDh2MjF6bS0yOSw2YzAsMTYsNiwzMCwxNyw0MCwzLDEsNSwxLDgsMSw1LDAsMTAtMSwxNS0zQzM3LDk1LDI5LDc5LDI5LDYyVjQybC0xOSw1djIweiIgZmlsbD0iIzAxMDEwMSIvPjxkZWZzPjxsaW5lYXJHcmFkaWVudCBpZD0iYSIgeDE9Ijg0IiB5MT0iNDEiIHgyPSI3NSIgeTI9IjEyMCIgZ3JhZGllbnRVbml0cz0idXNlclNwYWNlT25Vc2UiPjxzdG9wIHN0b3AtY29sb3I9IiNmZmYiLz48c3RvcCBvZmZzZXQ9IjEiIHN0b3AtY29sb3I9IiMyZTJlMmUiLz48L2xpbmVhckdyYWRpZW50PjxsaW5lYXJHcmFkaWVudCBpZD0iYiIgeDE9Ijg0IiB5MT0iNDEiIHgyPSI3NSIgeTI9IjEyMCIgZ3JhZGllbnRVbml0cz0idXNlclNwYWNlT25Vc2UiPjxzdG9wIHN0b3AtY29sb3I9IiNmZmYiLz48c3RvcCBvZmZzZXQ9IjEiIHN0b3AtY29sb3I9IiMyZTJlMmUiLz48L2xpbmVhckdyYWRpZW50PjxsaW5lYXJHcmFkaWVudCBpZD0iYyIgeDE9Ijg0IiB5MT0iNDEiIHgyPSI3NSIgeTI9IjEyMCIgZ3JhZGllbnRVbml0cz0idXNlclNwYWNlT25Vc2UiPjxzdG9wIHN0b3AtY29sb3I9IiNmZmYiLz48c3RvcCBvZmZzZXQ9IjEiIHN0b3AtY29sb3I9IiMyZTJlMmUiLz48L2xpbmVhckdyYWRpZW50PjwvZGVmcz48L3N2Zz4=&labelColor=white)](https://mineru.net/OpenSourceTools/Extractor?source=github)
[![ModelScope](https://img.shields.io/badge/Demo_on_ModelScope-purple?logo=data:image/svg+xml;base64,PHN2ZyB3aWR0aD0iMjIzIiBoZWlnaHQ9IjIwMCIgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzIwMDAvc3ZnIj4KCiA8Zz4KICA8dGl0bGU+TGF5ZXIgMTwvdGl0bGU+CiAgPHBhdGggaWQ9InN2Z18xNCIgZmlsbD0iIzYyNGFmZiIgZD0ibTAsODkuODRsMjUuNjUsMGwwLDI1LjY0OTk5bC0yNS42NSwwbDAsLTI1LjY0OTk5eiIvPgogIDxwYXRoIGlkPSJzdmdfMTUiIGZpbGw9IiM2MjRhZmYiIGQ9Im05OS4xNCwxMTUuNDlsMjUuNjUsMGwwLDI1LjY1bC0yNS42NSwwbDAsLTI1LjY1eiIvPgogIDxwYXRoIGlkPSJzdmdfMTYiIGZpbGw9IiM2MjRhZmYiIGQ9Im0xNzYuMDksMTQxLjE0bC0yNS42NDk5OSwwbDAsMjIuMTlsNDcuODQsMGwwLC00Ny44NGwtMjIuMTksMGwwLDI1LjY1eiIvPgogIDxwYXRoIGlkPSJzdmdfMTciIGZpbGw9IiMzNmNmZDEiIGQ9Im0xMjQuNzksODkuODRsMjUuNjUsMGwwLDI1LjY0OTk5bC0yNS42NSwwbDAsLTI1LjY0OTk5eiIvPgogIDxwYXRoIGlkPSJzdmdfMTgiIGZpbGw9IiMzNmNmZDEiIGQ9Im0wLDY0LjE5bDI1LjY1LDBsMCwyNS42NWwtMjUuNjUsMGwwLC0yNS42NXoiLz4KICA8cGF0aCBpZD0ic3ZnXzE5IiBmaWxsPSIjNjI0YWZmIiBkPSJtMTk4LjI4LDg5Ljg0bDI1LjY0OTk5LDBsMCwyNS42NDk5OWwtMjUuNjQ5OTksMGwwLC0yNS42NDk5OXoiLz4KICA8cGF0aCBpZD0ic3ZnXzIwIiBmaWxsPSIjMzZjZmQxIiBkPSJtMTk4LjI4LDY0LjE5bDI1LjY0OTk5LDBsMCwyNS42NWwtMjUuNjQ5OTksMGwwLC0yNS42NXoiLz4KICA8cGF0aCBpZD0ic3ZnXzIxIiBmaWxsPSIjNjI0YWZmIiBkPSJtMTUwLjQ0LDQybDAsMjIuMTlsMjUuNjQ5OTksMGwwLDI1LjY1bDIyLjE5LDBsMCwtNDcuODRsLTQ3Ljg0LDB6Ii8+CiAgPHBhdGggaWQ9InN2Z18yMiIgZmlsbD0iIzM2Y2ZkMSIgZD0ibTczLjQ5LDg5Ljg0bDI1LjY1LDBsMCwyNS42NDk5OWwtMjUuNjUsMGwwLC0yNS42NDk5OXoiLz4KICA8cGF0aCBpZD0ic3ZnXzIzIiBmaWxsPSIjNjI0YWZmIiBkPSJtNDcuODQsNjQuMTlsMjUuNjUsMGwwLC0yMi4xOWwtNDcuODQsMGwwLDQ3Ljg0bDIyLjE5LDBsMCwtMjUuNjV6Ii8+CiAgPHBhdGggaWQ9InN2Z18yNCIgZmlsbD0iIzYyNGFmZiIgZD0ibTQ3Ljg0LDExNS40OWwtMjIuMTksMGwwLDQ3Ljg0bDQ3Ljg0LDBsMCwtMjIuMTlsLTI1LjY1LDBsMCwtMjUuNjV6Ii8+CiA8L2c+Cjwvc3ZnPg==&labelColor=white)](https://www.modelscope.cn/studios/OpenDataLab/MinerU)
[![HuggingFace](https://img.shields.io/badge/Demo_on_HuggingFace-yellow.svg?logo=data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAF8AAABYCAMAAACkl9t/AAAAk1BMVEVHcEz/nQv/nQv/nQr/nQv/nQr/nQv/nQv/nQr/wRf/txT/pg7/yRr/rBD/zRz/ngv/oAz/zhz/nwv/txT/ngv/0B3+zBz/nQv/0h7/wxn/vRb/thXkuiT/rxH/pxD/ogzcqyf/nQvTlSz/czCxky7/SjifdjT/Mj3+Mj3wMj15aTnDNz+DSD9RTUBsP0FRO0Q6O0WyIxEIAAAAGHRSTlMADB8zSWF3krDDw8TJ1NbX5efv8ff9/fxKDJ9uAAAGKklEQVR42u2Z63qjOAyGC4RwCOfB2JAGqrSb2WnTw/1f3UaWcSGYNKTdf/P+mOkTrE+yJBulvfvLT2A5ruenaVHyIks33npl/6C4s/ZLAM45SOi/1FtZPyFur1OYofBX3w7d54Bxm+E8db+nDr12ttmESZ4zludJEG5S7TO72YPlKZFyE+YCYUJTBZsMiNS5Sd7NlDmKM2Eg2JQg8awbglfqgbhArjxkS7dgp2RH6hc9AMLdZYUtZN5DJr4molC8BfKrEkPKEnEVjLbgW1fLy77ZVOJagoIcLIl+IxaQZGjiX597HopF5CkaXVMDO9Pyix3AFV3kw4lQLCbHuMovz8FallbcQIJ5Ta0vks9RnolbCK84BtjKRS5uA43hYoZcOBGIG2Epbv6CvFVQ8m8loh66WNySsnN7htL58LNp+NXT8/PhXiBXPMjLSxtwp8W9f/1AngRierBkA+kk/IpUSOeKByzn8y3kAAAfh//0oXgV4roHm/kz4E2z//zRc3/lgwBzbM2mJxQEa5pqgX7d1L0htrhx7LKxOZlKbwcAWyEOWqYSI8YPtgDQVjpB5nvaHaSnBaQSD6hweDi8PosxD6/PT09YY3xQA7LTCTKfYX+QHpA0GCcqmEHvr/cyfKQTEuwgbs2kPxJEB0iNjfJcCTPyocx+A0griHSmADiC91oNGVwJ69RudYe65vJmoqfpul0lrqXadW0jFKH5BKwAeCq+Den7s+3zfRJzA61/Uj/9H/VzLKTx9jFPPdXeeP+L7WEvDLAKAIoF8bPTKT0+TM7W8ePj3Rz/Yn3kOAp2f1Kf0Weony7pn/cPydvhQYV+eFOfmOu7VB/ViPe34/EN3RFHY/yRuT8ddCtMPH/McBAT5s+vRde/gf2c/sPsjLK+m5IBQF5tO+h2tTlBGnP6693JdsvofjOPnnEHkh2TnV/X1fBl9S5zrwuwF8NFrAVJVwCAPTe8gaJlomqlp0pv4Pjn98tJ/t/fL++6unpR1YGC2n/KCoa0tTLoKiEeUPDl94nj+5/Tv3/eT5vBQ60X1S0oZr+IWRR8Ldhu7AlLjPISlJcO9vrFotky9SpzDequlwEir5beYAc0R7D9KS1DXva0jhYRDXoExPdc6yw5GShkZXe9QdO/uOvHofxjrV/TNS6iMJS+4TcSTgk9n5agJdBQbB//IfF/HpvPt3Tbi7b6I6K0R72p6ajryEJrENW2bbeVUGjfgoals4L443c7BEE4mJO2SpbRngxQrAKRudRzGQ8jVOL2qDVjjI8K1gc3TIJ5KiFZ1q+gdsARPB4NQS4AjwVSt72DSoXNyOWUrU5mQ9nRYyjp89Xo7oRI6Bga9QNT1mQ/ptaJq5T/7WcgAZywR/XlPGAUDdet3LE+qS0TI+g+aJU8MIqjo0Kx8Ly+maxLjJmjQ18rA0YCkxLQbUZP1WqdmyQGJLUm7VnQFqodmXSqmRrdVpqdzk5LvmvgtEcW8PMGdaS23EOWyDVbACZzUJPaqMbjDxpA3Qrgl0AikimGDbqmyT8P8NOYiqrldF8rX+YN7TopX4UoHuSCYY7cgX4gHwclQKl1zhx0THf+tCAUValzjI7Wg9EhptrkIcfIJjA94evOn8B2eHaVzvBrnl2ig0So6hvPaz0IGcOvTHvUIlE2+prqAxLSQxZlU2stql1NqCCLdIiIN/i1DBEHUoElM9dBravbiAnKqgpi4IBkw+utSPIoBijDXJipSVV7MpOEJUAc5Qmm3BnUN+w3hteEieYKfRZSIUcXKMVf0u5wD4EwsUNVvZOtUT7A2GkffHjByWpHqvRBYrTV72a6j8zZ6W0DTE86Hn04bmyWX3Ri9WH7ZU6Q7h+ZHo0nHUAcsQvVhXRDZHChwiyi/hnPuOsSEF6Exk3o6Y9DT1eZ+6cASXk2Y9k+6EOQMDGm6WBK10wOQJCBwren86cPPWUcRAnTVjGcU1LBgs9FURiX/e6479yZcLwCBmTxiawEwrOcleuu12t3tbLv/N4RLYIBhYexm7Fcn4OJcn0+zc+s8/VfPeddZHAGN6TT8eGczHdR/Gts1/MzDkThr23zqrVfAMFT33Nx1RJsx1k5zuWILLnG/vsH+Fv5D4NTVcp1Gzo8AAAAAElFTkSuQmCC&labelColor=white)](https://huggingface.co/spaces/opendatalab/MinerU)
[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/gist/myhloli/3b3a00a4a0a61577b6c30f989092d20d/mineru_demo.ipynb)
[![arXiv](https://img.shields.io/badge/arXiv-2409.18839-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2409.18839)
[![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/opendatalab/MinerU)
<div align="center" xmlns="http://www.w3.org/1999/html">
<a href="https://trendshift.io/repositories/11174" target="_blank"><img src="https://trendshift.io/api/badge/repositories/11174" alt="opendatalab%2FMinerU | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
<!-- hot link -->
<p align="center">
<a href="https://github.com/opendatalab/PDF-Extract-Kit">PDF-Extract-Kit: 高质量PDF解析工具箱</a>🔥🔥🔥
<br>
<br>
🚀<a href="https://mineru.net/?source=github">MinerU 官网入口→✅ 免装在线版 ✅ 全功能客户端 ✅ 开发者API在线调用,省去部署麻烦,多种产品形态一键get,速冲!</a>
</p>
<!-- join us -->
<p align="center">
👋 join us on <a href="https://discord.gg/Tdedn9GTXq" target="_blank">Discord</a> and <a href="http://mineru.space/s/V85Yl" target="_blank">WeChat</a>
</p>
</div>
## 项目简介
MinerU是一款将PDF转化为机器可读格式的工具(如markdown、json),可以很方便地抽取为任意格式。
MinerU诞生于[书生-浦语](https://github.com/InternLM/InternLM)的预训练过程中,我们将会集中精力解决科技文献中的符号转化问题,希望在大模型时代为科技发展做出贡献。
相比国内外知名商用产品MinerU还很年轻,如果遇到问题或者结果不及预期请到[issue](https://github.com/opendatalab/MinerU/issues)提交问题,同时**附上相关PDF**
![type:video](https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c)
## 主要功能
- 删除页眉、页脚、脚注、页码等元素,确保语义连贯
- 输出符合人类阅读顺序的文本,适用于单栏、多栏及复杂排版
- 保留原文档的结构,包括标题、段落、列表等
- 提取图像、图片描述、表格、表格标题及脚注
- 自动识别并转换文档中的公式为LaTeX格式
- 自动识别并转换文档中的表格为HTML格式
- 自动检测扫描版PDF和乱码PDF,并启用OCR功能
- OCR支持84种语言的检测与识别
- 支持多种输出格式,如多模态与NLP的Markdown、按阅读顺序排序的JSON、含有丰富信息的中间格式等
- 支持多种可视化结果,包括layout可视化、span可视化等,便于高效确认输出效果与质检
- 支持纯CPU环境运行,并支持 GPU(CUDA)/NPU(CANN)/MPS 加速
- 兼容Windows、Linux和Mac平台
\ No newline at end of file
# Known Issues
- 阅读顺序基于模型对可阅读内容在空间中的分布进行排序,在极端复杂的排版下可能会部分区域乱序
- 对竖排文字的支持较为有限
- 目录和列表通过规则进行识别,少部分不常见的列表形式可能无法识别
- 代码块在layout模型里还没有支持
- 漫画书、艺术图册、小学教材、习题尚不能很好解析
- 表格识别在复杂表格上可能会出现行/列识别错误
- 在小语种PDF上,OCR识别可能会出现字符不准确的情况(如拉丁文的重音符号、阿拉伯文易混淆字符等)
- 部分公式可能会无法在markdown中渲染
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment