Unverified Commit 0b8c6142 authored by Xiaomeng Zhao's avatar Xiaomeng Zhao Committed by GitHub
Browse files

Merge pull request #2464 from opendatalab/release-1.3.11

Release 1.3.11
parents 50700646 c1b387ab
......@@ -355,7 +355,7 @@ There are three different ways to experience MinerU:
</tr>
<tr>
<td colspan="3">Python Version</td>
<td colspan="3">>=3.10</td>
<td colspan="3">3.10~3.13</td>
</tr>
<tr>
<td colspan="3">Nvidia Driver Version</td>
......@@ -365,8 +365,7 @@ There are three different ways to experience MinerU:
</tr>
<tr>
<td colspan="3">CUDA Environment</td>
<td>11.8/12.4/12.6/12.8</td>
<td>11.8/12.4/12.6/12.8</td>
<td colspan="2"><a href="https://pytorch.org/get-started/locally/">Refer to the PyTorch official website</a></td>
<td>None</td>
</tr>
<tr>
......@@ -397,7 +396,7 @@ Synced with dev branch updates:
#### 1. Install magic-pdf
```bash
conda create -n mineru 'python>=3.10' -y
conda create -n mineru 'python=3.12' -y
conda activate mineru
pip install -U "magic-pdf[full]"
```
......
......@@ -344,7 +344,7 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
</tr>
<tr>
<td colspan="3">python版本</td>
<td colspan="3">>=3.10</td>
<td colspan="3">3.10~3.13</td>
</tr>
<tr>
<td colspan="3">Nvidia Driver 版本</td>
......@@ -354,8 +354,7 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
</tr>
<tr>
<td colspan="3">CUDA环境</td>
<td>11.8/12.4/12.6/12.8</td>
<td>11.8/12.4/12.6/12.8</td>
<td colspan="2"><a href="https://pytorch.org/get-started/locally/">Refer to the PyTorch official website</a></td>
<td>None</td>
</tr>
<tr>
......@@ -390,7 +389,7 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
> 最新版本国内镜像源同步可能会有延迟,请耐心等待
```bash
conda create -n mineru 'python>=3.10' -y
conda create -n mineru 'python=3.12' -y
conda activate mineru
pip install -U "magic-pdf[full]" -i https://mirrors.aliyun.com/pypi/simple
```
......
......@@ -45,7 +45,7 @@ RUN /bin/bash -c "wget https://gcore.jsdelivr.net/gh/opendatalab/MinerU@master/m
pip3 install -U magic-pdf[full] -i https://mirrors.aliyun.com/pypi/simple"
# Download models and update the configuration file
RUN /bin/bash -c "pip3 install modelscope && \
RUN /bin/bash -c "pip3 install modelscope -i https://mirrors.aliyun.com/pypi/simple && \
wget https://gcore.jsdelivr.net/gh/opendatalab/MinerU@master/scripts/download_models.py -O download_models.py && \
python3 download_models.py && \
sed -i 's|cpu|cuda|g' /root/magic-pdf.json"
......
......@@ -54,7 +54,7 @@ In the final step, enter `yes`, close the terminal, and reopen it.
### 4. Create an Environment Using Conda
```bash
conda create -n mineru 'python>=3.10' -y
conda create -n mineru 'python=3.12' -y
conda activate mineru
```
......
......@@ -54,7 +54,7 @@ bash Anaconda3-2024.06-1-Linux-x86_64.sh
## 4. 使用conda 创建环境
```bash
conda create -n mineru 'python>=3.10' -y
conda create -n mineru 'python=3.12' -y
conda activate mineru
```
......
......@@ -2,11 +2,12 @@
### 1. Install CUDA and cuDNN
You need to install a CUDA version that is compatible with torch's requirements. Currently, torch supports CUDA 11.8/12.4/12.6.
You need to install a CUDA version that is compatible with torch's requirements. For details, please refer to the [official PyTorch website](https://pytorch.org/get-started/locally/).
- CUDA 11.8 https://developer.nvidia.com/cuda-11-8-0-download-archive
- CUDA 12.4 https://developer.nvidia.com/cuda-12-4-0-download-archive
- CUDA 12.6 https://developer.nvidia.com/cuda-12-6-0-download-archive
- CUDA 12.8 https://developer.nvidia.com/cuda-12-8-0-download-archive
### 2. Install Anaconda
......@@ -17,7 +18,7 @@ Download link: https://repo.anaconda.com/archive/Anaconda3-2024.06-1-Windows-x86
### 3. Create an Environment Using Conda
```bash
conda create -n mineru 'python>=3.10' -y
conda create -n mineru 'python=3.12' -y
conda activate mineru
```
......@@ -63,7 +64,7 @@ If your graphics card has at least 6GB of VRAM, follow these steps to test CUDA-
1. **Overwrite the installation of torch and torchvision** supporting CUDA.(Please select the appropriate index-url based on your CUDA version. For more details, refer to the [PyTorch official website](https://pytorch.org/get-started/locally/).)
```
pip install --force-reinstall torch torchvision "numpy<=2.1.1" --index-url https://download.pytorch.org/whl/cu124
pip install --force-reinstall torch torchvision --index-url https://download.pytorch.org/whl/cu124
```
2. **Modify the value of `"device-mode"`** in the `magic-pdf.json` configuration file located in your user directory.
......
# Windows10/11
## 1. 安装cuda和cuDNN
## 1. 安装cuda环境
需要安装符合torch要求的cuda版本,torch目前支持11.8/12.4/12.6
需要安装符合torch要求的cuda版本,具体可参考[torch官网](https://pytorch.org/get-started/locally/)
- CUDA 11.8 https://developer.nvidia.com/cuda-11-8-0-download-archive
- CUDA 12.4 https://developer.nvidia.com/cuda-12-4-0-download-archive
- CUDA 12.6 https://developer.nvidia.com/cuda-12-6-0-download-archive
- CUDA 12.8 https://developer.nvidia.com/cuda-12-8-0-download-archive
## 2. 安装anaconda
......@@ -18,7 +19,7 @@ https://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/Anaconda3-2024.06-1-Window
## 3. 使用conda 创建环境
```bash
conda create -n mineru 'python>=3.10' -y
conda create -n mineru 'python=3.12' -y
conda activate mineru
```
......@@ -64,7 +65,7 @@ pip install -U magic-pdf[full] -i https://mirrors.aliyun.com/pypi/simple
**1.覆盖安装支持cuda的torch和torchvision**(请根据cuda版本选择合适的index-url,具体可参考[torch官网](https://pytorch.org/get-started/locally/))
```bash
pip install --force-reinstall torch torchvision "numpy<=2.1.1" --index-url https://download.pytorch.org/whl/cu124
pip install --force-reinstall torch torchvision --index-url https://download.pytorch.org/whl/cu124
```
**2.修改【用户目录】中配置文件magic-pdf.json中"device-mode"的值**
......
......@@ -156,7 +156,10 @@ def doc_analyze(
batch_images = [images_with_extra_info]
results = []
for batch_image in batch_images:
processed_images_count = 0
for index, batch_image in enumerate(batch_images):
processed_images_count += len(batch_image)
logger.info(f'Batch {index + 1}/{len(batch_images)}: {processed_images_count} pages/{len(images_with_extra_info)} pages')
result = may_batch_image_analyze(batch_image, ocr, show_log,layout_model, formula_enable, table_enable)
results.extend(result)
......
......@@ -66,9 +66,9 @@ LEFT_RIGHT_REMOVE_PATTERN = re.compile(r'\\left\.?|\\right\.?')
def fix_latex_left_right(s):
"""
修复LaTeX中的\left和\right命令
修复LaTeX中的\\left和\\right命令
1. 确保它们后面跟有效分隔符
2. 平衡\left和\right的数量
2. 平衡\\left和\\right的数量
"""
# 白名单分隔符
valid_delims_list = [r'(', r')', r'[', r']', r'{', r'}', r'/', r'|',
......@@ -106,7 +106,7 @@ def fix_latex_left_right(s):
def fix_left_right_pairs(latex_formula):
"""
检测并修复LaTeX公式中\left和\right不在同一组的情况
检测并修复LaTeX公式中\\left和\\right不在同一组的情况
Args:
latex_formula (str): 输入的LaTeX公式
......@@ -308,9 +308,9 @@ ENV_FORMAT_PATTERNS = {env: re.compile(r'\\begin\{' + env + r'\}\{([^}]*)\}') fo
def fix_latex_environments(s):
"""
检测LaTeX中环境(如array)的\begin和\end是否匹配
1. 如果缺少\begin标签则在开头添加
2. 如果缺少\end标签则在末尾添加
检测LaTeX中环境(如array)的\\begin和\\end是否匹配
1. 如果缺少\\begin标签则在开头添加
2. 如果缺少\\end标签则在末尾添加
"""
for env in ENV_TYPES:
begin_count = len(ENV_BEGIN_PATTERNS[env].findall(s))
......@@ -334,7 +334,7 @@ def fix_latex_environments(s):
UP_PATTERN = re.compile(r'\\up([a-zA-Z]+)')
COMMANDS_TO_REMOVE_PATTERN = re.compile(
r'\\(?:lefteqn|boldmath|ensuremath|centering|textsubscript|sides|textsl|textcent|emph)')
r'\\(?:lefteqn|boldmath|ensuremath|centering|textsubscript|sides|textsl|textcent|emph|protect|null)')
REPLACEMENTS_PATTERNS = {
re.compile(r'\\underbar'): r'\\underline',
re.compile(r'\\Bar'): r'\\hat',
......@@ -346,6 +346,9 @@ REPLACEMENTS_PATTERNS = {
re.compile(r'\\textunderscore'): r'\\_',
re.compile(r'\\fint'): r'⨏',
re.compile(r'\\up '): r'\\ ',
re.compile(r'\\vline = '): r'\\models ',
re.compile(r'\\vDash '): r'\\models ',
re.compile(r'\\sq \\sqcup '): r'\\square ',
}
QQUAD_PATTERN = re.compile(r'\\qquad(?!\s)')
......
......@@ -76,11 +76,11 @@ In the final step, enter ``yes``, close the terminal, and reopen it.
4. Create an Environment Using Conda
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Specify Python version 3.10.
Specify Python version 3.10~3.13.
.. code:: sh
conda create -n mineru 'python>=3.10' -y
conda create -n mineru 'python=3.12' -y
conda activate mineru
5. Install Applications
......@@ -155,14 +155,15 @@ to test CUDA acceleration:
Windows 10/11
--------------
1. Install CUDA and cuDNN
1. Install CUDA
~~~~~~~~~~~~~~~~~~~~~~~~~
You need to install a CUDA version that is compatible with torch's requirements. Currently, torch supports CUDA 11.8/12.4/12.6.
You need to install a CUDA version that is compatible with torch's requirements. For details, please refer to the [official PyTorch website](https://pytorch.org/get-started/locally/).
- CUDA 11.8 https://developer.nvidia.com/cuda-11-8-0-download-archive
- CUDA 12.4 https://developer.nvidia.com/cuda-12-4-0-download-archive
- CUDA 12.6 https://developer.nvidia.com/cuda-12-6-0-download-archive
- CUDA 12.8 https://developer.nvidia.com/cuda-12-8-0-download-archive
2. Install Anaconda
......@@ -177,7 +178,7 @@ Download link: https://repo.anaconda.com/archive/Anaconda3-2024.06-1-Windows-x86
::
conda create -n mineru 'python>=3.10' -y
conda create -n mineru 'python=3.12' -y
conda activate mineru
4. Install Applications
......
......@@ -61,7 +61,7 @@ Also you can try `online demo <https://www.modelscope.cn/studios/OpenDataLab/Min
</tr>
<tr>
<td colspan="3">Python Version</td>
<td colspan="3">3.10~3.12</td>
<td colspan="3">3.10~3.13</td>
</tr>
<tr>
<td colspan="3">Nvidia Driver Version</td>
......@@ -71,8 +71,7 @@ Also you can try `online demo <https://www.modelscope.cn/studios/OpenDataLab/Min
</tr>
<tr>
<td colspan="3">CUDA Environment</td>
<td>11.8/12.4/12.6/12.8</td>
<td>11.8/12.4/12.6/12.8</td>
<td colspan="2"><a href="https://pytorch.org/get-started/locally/">Refer to the PyTorch official website</a></td>
<td>None</td>
</tr>
<tr>
......@@ -97,7 +96,7 @@ Create an environment
.. code-block:: shell
conda create -n mineru 'python>=3.10' -y
conda create -n mineru 'python=3.12' -y
conda activate mineru
pip install -U "magic-pdf[full]"
......
......@@ -4,9 +4,7 @@
## 环境配置
请使用以下命令配置所需的环境:
```bash
pip install -U litserve python-multipart filetype
pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com
pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu118
pip install -U magic-pdf[full] litserve python-multipart filetype
```
## 快速使用
......
......@@ -21,6 +21,7 @@ from magic_pdf.libs.config_reader import get_bucket_name, get_s3_config
from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
from magic_pdf.operators.models import InferenceResult
from magic_pdf.operators.pipes import PipeResult
from fastapi import Form
model_config.__use_inside_model__ = True
......@@ -102,6 +103,7 @@ def init_writers(
# 处理上传的文件
file_bytes = file.file.read()
file_extension = os.path.splitext(file.filename)[1]
writer = FileBasedDataWriter(output_path)
image_writer = FileBasedDataWriter(output_image_path)
os.makedirs(output_image_path, exist_ok=True)
......@@ -176,14 +178,14 @@ def encode_image(image_path: str) -> str:
)
async def file_parse(
file: UploadFile = None,
file_path: str = None,
parse_method: str = "auto",
is_json_md_dump: bool = False,
output_dir: str = "output",
return_layout: bool = False,
return_info: bool = False,
return_content_list: bool = False,
return_images: bool = False,
file_path: str = Form(None),
parse_method: str = Form("auto"),
is_json_md_dump: bool = Form(False),
output_dir: str = Form("output"),
return_layout: bool = Form(False),
return_info: bool = Form(False),
return_content_list: bool = Form(False),
return_images: bool = Form(False),
):
"""
Execute the process of converting PDF to JSON and MD, outputting MD and JSON files
......
......@@ -7,9 +7,9 @@ numpy>=1.21.6
pydantic>=2.7.2,<2.11
PyMuPDF>=1.24.9,<1.25.0
scikit-learn>=1.0.2
torch>=2.2.2,!=2.5.0,!=2.5.1
torch>=2.2.2,!=2.5.0,!=2.5.1,<3
torchvision
transformers>=4.49.0,!=4.51.0,<5.0.0
pdfminer.six==20250324
pdfminer.six==20250506
tqdm>=4.67.1
# The requirements.txt must ensure that only necessary external dependencies are introduced. If there are new dependencies to add, please contact the project administrator.
......@@ -81,7 +81,7 @@ if __name__ == '__main__':
"Programming Language :: Python :: 3.12",
"Programming Language :: Python :: 3.13",
],
python_requires=">=3.10,<4", # 项目依赖的 Python 版本
python_requires=">=3.10,<3.14", # 项目依赖的 Python 版本
entry_points={
"console_scripts": [
"magic-pdf = magic_pdf.tools.cli:cli",
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment