Unverified Commit 845a3ff0 authored by Xiaomeng Zhao's avatar Xiaomeng Zhao Committed by GitHub
Browse files

Merge pull request #969 from opendatalab/release-0.9.3

Release 0.9.3
parents d0558abb 6083e109
...@@ -18,6 +18,8 @@ Read the contet from jsonl which may located on local machine or remote s3. if y ...@@ -18,6 +18,8 @@ Read the contet from jsonl which may located on local machine or remote s3. if y
.. code:: python .. code:: python
from magic_pdf.data.io.read_api import *
# read jsonl from local machine # read jsonl from local machine
datasets = read_jsonl("tt.jsonl", None) datasets = read_jsonl("tt.jsonl", None)
...@@ -33,6 +35,8 @@ Read pdf from path or directory. ...@@ -33,6 +35,8 @@ Read pdf from path or directory.
.. code:: python .. code:: python
from magic_pdf.data.io.read_api import *
# read pdf path # read pdf path
datasets = read_local_pdfs("tt.pdf") datasets = read_local_pdfs("tt.pdf")
...@@ -47,10 +51,11 @@ Read images from path or directory ...@@ -47,10 +51,11 @@ Read images from path or directory
.. code:: python .. code:: python
from magic_pdf.data.io.read_api import *
# read from image path # read from image path
datasets = read_local_images("tt.png") datasets = read_local_images("tt.png")
# read files from directory that endswith suffix in suffixes array # read files from directory that endswith suffix in suffixes array
datasets = read_local_images("images/", suffixes=["png", "jpg"]) datasets = read_local_images("images/", suffixes=["png", "jpg"])
......
...@@ -9,14 +9,16 @@ appropriate guide based on your system: ...@@ -9,14 +9,16 @@ appropriate guide based on your system:
- :ref:`ubuntu_22_04_lts_section` - :ref:`ubuntu_22_04_lts_section`
- :ref:`windows_10_or_11_section` - :ref:`windows_10_or_11_section`
- Quick Deployment with Docker
- Quick Deployment with Docker > Docker requires a GPU with at least .. admonition:: Important
16GB of VRAM, and all acceleration features are enabled by default. :class: tip
.. note:: Docker requires a GPU with at least 16GB of VRAM, and all acceleration features are enabled by default.
Before running this Docker, you can use the following command to check if your device supports CUDA acceleration on Docker.
Before running this Docker, you can use the following command to .. code-block:: bash
check if your device supports CUDA acceleration on Docker.
bash docker run --rm --gpus=all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi bash docker run --rm --gpus=all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
...@@ -42,8 +44,9 @@ Ubuntu 22.04 LTS ...@@ -42,8 +44,9 @@ Ubuntu 22.04 LTS
If you see information similar to the following, it means that the If you see information similar to the following, it means that the
NVIDIA drivers are already installed, and you can skip Step 2. NVIDIA drivers are already installed, and you can skip Step 2.
Notice:``CUDA Version`` should be >= 12.1, If the displayed version .. note::
number is less than 12.1, please upgrade the driver.
``CUDA Version`` should be >= 12.1, If the displayed version number is less than 12.1, please upgrade the driver.
.. code:: text .. code:: text
...@@ -105,8 +108,10 @@ Specify Python version 3.10. ...@@ -105,8 +108,10 @@ Specify Python version 3.10.
pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com
❗ After installation, make sure to check the version of ``magic-pdf`` .. admonition:: Important
using the following command: :class: tip
❗ After installation, make sure to check the version of ``magic-pdf`` using the following command:
.. code:: sh .. code:: sh
...@@ -127,6 +132,9 @@ the script will automatically generate a ``magic-pdf.json`` file in the ...@@ -127,6 +132,9 @@ the script will automatically generate a ``magic-pdf.json`` file in the
user directory and configure the default model path. You can find the user directory and configure the default model path. You can find the
``magic-pdf.json`` file in your user directory. ``magic-pdf.json`` file in your user directory.
.. admonition:: TIP
:class: tip
The user directory for Linux is “/home/username”. The user directory for Linux is “/home/username”.
8. First Run 8. First Run
...@@ -137,7 +145,7 @@ Download a sample file from the repository and test it. ...@@ -137,7 +145,7 @@ Download a sample file from the repository and test it.
.. code:: sh .. code:: sh
wget https://github.com/opendatalab/MinerU/raw/master/demo/small_ocr.pdf wget https://github.com/opendatalab/MinerU/raw/master/demo/small_ocr.pdf
magic-pdf -p small_ocr.pdf magic-pdf -p small_ocr.pdf -o ./output
9. Test CUDA Acceleration 9. Test CUDA Acceleration
~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~
...@@ -145,10 +153,6 @@ Download a sample file from the repository and test it. ...@@ -145,10 +153,6 @@ Download a sample file from the repository and test it.
If your graphics card has at least **8GB** of VRAM, follow these steps If your graphics card has at least **8GB** of VRAM, follow these steps
to test CUDA acceleration: to test CUDA acceleration:
❗ Due to the extremely limited nature of 8GB VRAM for running this
application, you need to close all other programs using VRAM to
ensure that 8GB of VRAM is available when running this application.
1. Modify the value of ``"device-mode"`` in the ``magic-pdf.json`` 1. Modify the value of ``"device-mode"`` in the ``magic-pdf.json``
configuration file located in your home directory. configuration file located in your home directory.
...@@ -162,7 +166,7 @@ to test CUDA acceleration: ...@@ -162,7 +166,7 @@ to test CUDA acceleration:
.. code:: sh .. code:: sh
magic-pdf -p small_ocr.pdf magic-pdf -p small_ocr.pdf -o ./output
10. Enable CUDA Acceleration for OCR 10. Enable CUDA Acceleration for OCR
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
...@@ -178,7 +182,9 @@ to test CUDA acceleration: ...@@ -178,7 +182,9 @@ to test CUDA acceleration:
.. code:: sh .. code:: sh
magic-pdf -p small_ocr.pdf magic-pdf -p small_ocr.pdf -o ./output
.. _windows_10_or_11_section: .. _windows_10_or_11_section:
...@@ -218,7 +224,8 @@ Python version must be 3.10. ...@@ -218,7 +224,8 @@ Python version must be 3.10.
pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com
.. .. admonition:: Important
:class: tip
❗️After installation, verify the version of ``magic-pdf``: ❗️After installation, verify the version of ``magic-pdf``:
...@@ -226,8 +233,7 @@ Python version must be 3.10. ...@@ -226,8 +233,7 @@ Python version must be 3.10.
magic-pdf --version magic-pdf --version
If the version number is less than 0.7.0, please report it in the If the version number is less than 0.7.0, please report it in the issues section.
issues section.
5. Download Models 5. Download Models
~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~
...@@ -242,6 +248,9 @@ the script will automatically generate a ``magic-pdf.json`` file in the ...@@ -242,6 +248,9 @@ the script will automatically generate a ``magic-pdf.json`` file in the
user directory and configure the default model path. You can find the user directory and configure the default model path. You can find the
``magic-pdf.json`` file in your 【user directory】 . ``magic-pdf.json`` file in your 【user directory】 .
.. admonition:: Tip
:class: tip
The user directory for Windows is “C:/Users/username”. The user directory for Windows is “C:/Users/username”.
7. First Run 7. First Run
...@@ -252,7 +261,7 @@ Download a sample file from the repository and test it. ...@@ -252,7 +261,7 @@ Download a sample file from the repository and test it.
.. code:: powershell .. code:: powershell
wget https://github.com/opendatalab/MinerU/raw/master/demo/small_ocr.pdf -O small_ocr.pdf wget https://github.com/opendatalab/MinerU/raw/master/demo/small_ocr.pdf -O small_ocr.pdf
magic-pdf -p small_ocr.pdf magic-pdf -p small_ocr.pdf -o ./output
8. Test CUDA Acceleration 8. Test CUDA Acceleration
~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~
...@@ -260,27 +269,23 @@ Download a sample file from the repository and test it. ...@@ -260,27 +269,23 @@ Download a sample file from the repository and test it.
If your graphics card has at least 8GB of VRAM, follow these steps to If your graphics card has at least 8GB of VRAM, follow these steps to
test CUDA-accelerated parsing performance. test CUDA-accelerated parsing performance.
❗ Due to the extremely limited nature of 8GB VRAM for running this 1. **Overwrite the installation of torch and torchvision** supporting CUDA.
application, you need to close all other programs using VRAM to
ensure that 8GB of VRAM is available when running this application.
1. **Overwrite the installation of torch and torchvision** supporting
CUDA.
:: .. code:: sh
pip install --force-reinstall torch==2.3.1 torchvision==0.18.1 --index-url https://download.pytorch.org/whl/cu118 pip install --force-reinstall torch==2.3.1 torchvision==0.18.1 --index-url https://download.pytorch.org/whl/cu118
.. .. admonition:: Important
:class: tip
❗️Ensure the following versions are specified in the command: ❗️Ensure the following versions are specified in the command:
::
.. code:: sh
torch==2.3.1 torchvision==0.18.1 torch==2.3.1 torchvision==0.18.1
These are the highest versions we support. Installing higher These are the highest versions we support. Installing higher versions without specifying them will cause the program to fail.
versions without specifying them will cause the program to fail.
2. **Modify the value of ``"device-mode"``** in the ``magic-pdf.json`` 2. **Modify the value of ``"device-mode"``** in the ``magic-pdf.json``
configuration file located in your user directory. configuration file located in your user directory.
...@@ -295,7 +300,7 @@ test CUDA-accelerated parsing performance. ...@@ -295,7 +300,7 @@ test CUDA-accelerated parsing performance.
:: ::
magic-pdf -p small_ocr.pdf magic-pdf -p small_ocr.pdf -o ./output
9. Enable CUDA Acceleration for OCR 9. Enable CUDA Acceleration for OCR
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
...@@ -311,5 +316,4 @@ test CUDA-accelerated parsing performance. ...@@ -311,5 +316,4 @@ test CUDA-accelerated parsing performance.
:: ::
magic-pdf -p small_ocr.pdf magic-pdf -p small_ocr.pdf -o ./output
Install Install
=============================================================== ===============================================================
If you encounter any installation issues, please first consult the FAQ. If you encounter any installation issues, please first consult the :doc:`../../additional_notes/faq`.
If the parsing results are not as expected, refer to the Known Issues. If the parsing results are not as expected, refer to the :doc:`../../additional_notes/known_issues`.
There are three different ways to experience MinerU
Pre-installation Notice—Hardware and Software Environment Support .. admonition:: Warning
------------------------------------------------------------------ :class: tip
To ensure the stability and reliability of the project, we only optimize **Pre-installation Notice—Hardware and Software Environment Support**
and test for specific hardware and software environments during
development. This ensures that users deploying and running the project To ensure the stability and reliability of the project, we only optimize
on recommended system configurations will get the best performance with and test for specific hardware and software environments during
the fewest compatibility issues. development. This ensures that users deploying and running the project
on recommended system configurations will get the best performance with
By focusing resources on the mainline environment, our team can more the fewest compatibility issues.
efficiently resolve potential bugs and develop new features.
By focusing resources on the mainline environment, our team can more
In non-mainline environments, due to the diversity of hardware and efficiently resolve potential bugs and develop new features.
software configurations, as well as third-party dependency compatibility
issues, we cannot guarantee 100% project availability. Therefore, for In non-mainline environments, due to the diversity of hardware and
users who wish to use this project in non-recommended environments, we software configurations, as well as third-party dependency compatibility
suggest carefully reading the documentation and FAQ first. Most issues issues, we cannot guarantee 100% project availability. Therefore, for
already have corresponding solutions in the FAQ. We also encourage users who wish to use this project in non-recommended environments, we
community feedback to help us gradually expand support. suggest carefully reading the documentation and FAQ first. Most issues
already have corresponding solutions in the FAQ. We also encourage
community feedback to help us gradually expand support.
.. raw:: html .. raw:: html
...@@ -44,8 +46,8 @@ community feedback to help us gradually expand support. ...@@ -44,8 +46,8 @@ community feedback to help us gradually expand support.
</tr> </tr>
<tr> <tr>
<td colspan="3">CPU</td> <td colspan="3">CPU</td>
<td>x86_64</td> <td>x86_64(unsupported ARM Linux)</td>
<td>x86_64</td> <td>x86_64(unsupported ARM Windows)</td>
<td>x86_64 / arm64</td> <td>x86_64 / arm64</td>
</tr> </tr>
<tr> <tr>
...@@ -54,7 +56,7 @@ community feedback to help us gradually expand support. ...@@ -54,7 +56,7 @@ community feedback to help us gradually expand support.
</tr> </tr>
<tr> <tr>
<td colspan="3">Python Version</td> <td colspan="3">Python Version</td>
<td colspan="3">3.10</td> <td colspan="3">3.10(Please make sure to create a Python 3.10 virtual environment using conda)</td>
</tr> </tr>
<tr> <tr>
<td colspan="3">Nvidia Driver Version</td> <td colspan="3">Nvidia Driver Version</td>
...@@ -71,19 +73,20 @@ community feedback to help us gradually expand support. ...@@ -71,19 +73,20 @@ community feedback to help us gradually expand support.
<tr> <tr>
<td rowspan="2">GPU Hardware Support List</td> <td rowspan="2">GPU Hardware Support List</td>
<td colspan="2">Minimum Requirement 8G+ VRAM</td> <td colspan="2">Minimum Requirement 8G+ VRAM</td>
<td colspan="2">3060ti/3070/3080/3080ti/4060/4070/4070ti<br> <td colspan="2">3060ti/3070/4060<br>
8G VRAM enables layout, formula recognition acceleration and OCR acceleration</td> 8G VRAM enables layout, formula recognition acceleration and OCR acceleration</td>
<td rowspan="2">None</td> <td rowspan="2">None</td>
</tr> </tr>
<tr> <tr>
<td colspan="2">Recommended Configuration 16G+ VRAM</td> <td colspan="2">Recommended Configuration 10G+ VRAM</td>
<td colspan="2">3090/3090ti/4070ti super/4080/4090<br> <td colspan="2">3080/3080ti/3090/3090ti/4070/4070ti/4070tisuper/4080/4090<br>
16G VRAM or more can enable layout, formula recognition, OCR acceleration and table recognition acceleration simultaneously 10G VRAM or more can enable layout, formula recognition, OCR acceleration and table recognition acceleration simultaneously
</td> </td>
</tr> </tr>
</table> </table>
Create an environment Create an environment
~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~
......
...@@ -55,5 +55,8 @@ directory. The output file list is as follows: ...@@ -55,5 +55,8 @@ directory. The output file list is as follows:
├── some_pdf_spans.pdf # smallest granularity bbox position information diagram ├── some_pdf_spans.pdf # smallest granularity bbox position information diagram
└── some_pdf_content_list.json # Rich text JSON arranged in reading order └── some_pdf_content_list.json # Rich text JSON arranged in reading order
For more information about the output files, please refer to the :doc:`../tutorial/output_file_description` .. admonition:: Tip
:class: tip
For more information about the output files, please refer to the :doc:`../tutorial/output_file_description`
Extract Content from Pdf
========================
.. code:: python
from magic_pdf.data.read_api import read_local_pdfs
from magic_pdf.pdf_parse_union_core_v2 import pdf_parse_union
from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" version="1.1" width="224" height="72" viewBox="-29 -3.67 224 72" xml:space="preserve">
<desc>Created with Fabric.js 5.2.4</desc>
<defs>
</defs>
<rect x="0" y="0" width="100%" height="100%" fill="transparent"></rect>
<g transform="matrix(1 0 0 1 112 36)" id="7a867f58-a908-4f30-a839-fb725512b521" >
<rect style="stroke: none; stroke-width: 1; stroke-dasharray: none; stroke-linecap: butt; stroke-dashoffset: 0; stroke-linejoin: miter; stroke-miterlimit: 4; fill: rgb(255,255,255); fill-rule: nonzero; opacity: 1; visibility: hidden;" vector-effect="non-scaling-stroke" x="-112" y="-36" rx="0" ry="0" width="224" height="72" />
</g>
<g transform="matrix(Infinity NaN NaN Infinity 0 0)" id="29611287-bf1c-4faf-8eb1-df32f6424829" >
</g>
<g transform="matrix(0.07 0 0 0.07 382.02 122.8)" id="60cdd44f-027a-437a-92c4-c8d44c60ef9e" >
<path style="stroke: rgb(0,0,0); stroke-width: 0; stroke-dasharray: none; stroke-linecap: butt; stroke-dashoffset: 0; stroke-linejoin: miter; stroke-miterlimit: 4; fill: rgb(50,50,42); fill-rule: nonzero; opacity: 1;" vector-effect="non-scaling-stroke" transform=" translate(-64, -64)" d="M 57.62 61.68 C 55.919999999999995 61.92 54.75 63.46 55 65.11 C 55.1668510745875 66.32380621250819 56.039448735907676 67.32218371690155 57.22 67.65 C 57.22 67.65 64.69 70.11 77.4 71.16000000000001 C 87.61000000000001 72.01 99.2 70.43 99.2 70.43 C 100.9 70.39 102.23 68.98 102.19 67.28 C 102.17037752125772 66.4652516996782 101.82707564255573 65.69186585376654 101.23597809465886 65.13079230830253 C 100.644880546762 64.56971876283853 99.85466451370849 64.26716220997277 99.03999999999999 64.29 C 98.83999999999999 64.29 98.63999999999999 64.33000000000001 98.42999999999999 64.37 C 98.42999999999999 64.37 87.08999999999999 65.78 77.88 65.02000000000001 C 65.72999999999999 64.05000000000001 59.11 61.83000000000001 59.11 61.83000000000001 C 58.63 61.670000000000016 58.1 61.59000000000001 57.62 61.670000000000016 Z M 57.62 46.46 C 55.919999999999995 46.7 54.75 48.24 55 49.89 C 55.1668510745875 51.10380621250818 56.039448735907676 52.10218371690154 57.22 52.43 C 57.22 52.43 64.69 54.89 77.4 55.94 C 87.61000000000001 56.79 99.2 55.21 99.2 55.21 C 100.9 55.17 102.23 53.76 102.19 52.06 C 102.17037752125772 51.245251699678214 101.82707564255573 50.47186585376654 101.23597809465886 49.91079230830253 C 100.644880546762 49.34971876283853 99.85466451370849 49.047162209972754 99.03999999999999 49.07 C 98.83999999999999 49.07 98.63999999999999 49.11 98.42999999999999 49.15 C 98.42999999999999 49.15 87.08999999999999 50.559999999999995 77.88 49.8 C 65.72999999999999 48.83 59.11 46.61 59.11 46.61 C 58.63 46.45 58.1 46.37 57.62 46.45 Z M 57.62 31.240000000000002 C 55.919999999999995 31.48 54.75 33.02 55 34.67 C 55.1668510745875 35.88380621250818 56.039448735907676 36.882183716901544 57.22 37.21 C 57.22 37.21 64.69 39.67 77.4 40.72 C 87.61000000000001 41.57 99.2 39.99 99.2 39.99 C 100.9 39.95 102.23 38.54 102.19 36.84 C 102.17037752125772 36.025251699678215 101.82707564255573 35.25186585376654 101.23597809465886 34.690792308302534 C 100.644880546762 34.12971876283853 99.85466451370849 33.827162209972755 99.03999999999999 33.85 C 98.83999999999999 33.85 98.63999999999999 33.89 98.42999999999999 33.93 C 98.42999999999999 33.93 87.08999999999999 35.339999999999996 77.88 34.58 C 65.72999999999999 33.61 59.11 31.389999999999997 59.11 31.389999999999997 C 58.63 31.229999999999997 58.1 31.189999999999998 57.62 31.229999999999997 Z M 57.62 16.060000000000002 C 55.919999999999995 16.3 54.75 17.840000000000003 55 19.490000000000002 C 55.1668510745875 20.703806212508187 56.039448735907676 21.702183716901544 57.22 22.03 C 57.22 22.03 64.69 24.490000000000002 77.4 25.54 C 87.61000000000001 26.39 99.2 24.81 99.2 24.81 C 100.9 24.77 102.23 23.36 102.19 21.66 C 102.17037752125772 20.84525169967821 101.82707564255573 20.07186585376654 101.23597809465886 19.510792308302534 C 100.644880546762 18.949718762838526 99.8546645137085 18.64716220997276 99.03999999999999 18.67 C 98.83999999999999 18.67 98.63999999999999 18.71 98.42999999999999 18.75 C 98.42999999999999 18.75 87.08999999999999 20.16 77.88 19.4 C 65.72999999999999 18.43 59.11 16.209999999999997 59.11 16.209999999999997 C 58.637850878541954 16.01924514007714 58.12188500879498 15.963839409097599 57.62 16.049999999999997 Z M 36.31 0 C 20.32 0.12 14.39 5.05 14.39 5.05 L 14.39 124.42 C 14.39 124.42 20.2 119.41 38.93 120.18 C 57.66 120.95000000000002 61.5 127.53 84.50999999999999 127.97000000000001 C 107.52 128.41000000000003 113.28999999999999 124.42000000000002 113.28999999999999 124.42000000000002 L 113.60999999999999 2.750000000000014 C 113.60999999999999 2.750000000000014 103.28 5.7 83.09 5.86 C 62.95 6.01 58.11 0.73 39.62 0.12 C 38.49 0.04 37.4 0 36.31 0 Z M 49.67 7.79 C 49.67 7.79 59.36 10.98 77.24000000000001 11.870000000000001 C 92.38000000000001 12.64 107.52000000000001 10.38 107.52000000000001 10.38 L 107.52000000000001 118.53 C 107.52000000000001 118.53 99.85000000000001 122.57000000000001 80.68 121.19 C 65.82000000000001 120.14 49.480000000000004 114.49 49.480000000000004 114.49 L 49.68000000000001 7.799999999999997 Z M 40.35 10.620000000000001 C 42.050000000000004 10.620000000000001 43.46 11.990000000000002 43.46 13.73 C 43.46 15.469999999999999 42.09 16.84 40.35 16.84 C 40.35 16.84 35.34 16.88 32.28 17.16 C 27.150000000000002 17.68 23.64 19.54 23.64 19.54 C 22.150000000000002 20.349999999999998 20.25 19.74 19.48 18.25 C 18.67 16.76 19.28 14.86 20.77 14.09 C 22.259999999999998 13.32 25.33 11.67 31.67 11.06 C 35.34 10.66 40.35 10.620000000000001 40.35 10.620000000000001 Z M 37.36 25.880000000000003 C 39.06 25.840000000000003 40.35 25.880000000000003 40.35 25.880000000000003 C 42.050000000000004 26.080000000000002 43.260000000000005 27.62 43.050000000000004 29.310000000000002 C 42.88374644848126 30.726609090871516 41.76660909087151 31.843746448481262 40.35 32.010000000000005 C 40.35 32.010000000000005 35.34 32.050000000000004 32.28 32.330000000000005 C 27.150000000000002 32.85000000000001 23.64 34.71000000000001 23.64 34.71000000000001 C 22.150000000000002 35.52000000000001 20.25 34.91000000000001 19.48 33.42000000000001 C 18.67 31.93000000000001 19.28 30.03000000000001 20.77 29.26000000000001 C 20.77 29.26000000000001 25.33 26.84000000000001 31.67 26.230000000000008 C 33.53 25.99000000000001 35.67 25.910000000000007 37.36 25.870000000000008 Z M 40.35 41.06 C 42.050000000000004 41.06 43.46 42.43 43.46 44.17 C 43.46 45.910000000000004 42.09 47.28 40.35 47.28 C 40.35 47.28 35.34 47.24 32.28 47.56 C 27.150000000000002 48.080000000000005 23.64 49.940000000000005 23.64 49.940000000000005 C 22.150000000000002 50.75000000000001 20.25 50.14000000000001 19.48 48.650000000000006 C 18.67 47.160000000000004 19.28 45.260000000000005 20.77 44.49000000000001 C 20.77 44.49000000000001 25.33 42.07000000000001 31.67 41.46000000000001 C 35.34 41.02000000000001 40.35 41.06000000000001 40.35 41.06000000000001 Z" stroke-linecap="round" />
</g>
<g transform="matrix(0.07 0 0 0.07 396.05 123.14)" style="" id="eb0df536-c517-4781-a7c0-3f84cd77c272" >
<text xml:space="preserve" font-family="Lato" font-size="40" font-style="normal" font-weight="400" style="stroke: none; stroke-width: 1; stroke-dasharray: none; stroke-linecap: butt; stroke-dashoffset: 0; stroke-linejoin: miter; stroke-miterlimit: 4; fill: rgb(0,0,0); fill-rule: nonzero; opacity: 1; white-space: pre;" ><tspan x="-130" y="12.57" >Read The Docs</tspan></text>
</g>
<g transform="matrix(0.28 0 0 0.28 27.88 36)" id="7b9eddb9-1652-4040-9437-2ab90652d624" >
<path style="stroke: rgb(0,0,0); stroke-width: 0; stroke-dasharray: none; stroke-linecap: butt; stroke-dashoffset: 0; stroke-linejoin: miter; stroke-miterlimit: 4; fill: rgb(50,50,42); fill-rule: nonzero; opacity: 1;" vector-effect="non-scaling-stroke" transform=" translate(-64, -64)" d="M 57.62 61.68 C 55.919999999999995 61.92 54.75 63.46 55 65.11 C 55.1668510745875 66.32380621250819 56.039448735907676 67.32218371690155 57.22 67.65 C 57.22 67.65 64.69 70.11 77.4 71.16000000000001 C 87.61000000000001 72.01 99.2 70.43 99.2 70.43 C 100.9 70.39 102.23 68.98 102.19 67.28 C 102.17037752125772 66.4652516996782 101.82707564255573 65.69186585376654 101.23597809465886 65.13079230830253 C 100.644880546762 64.56971876283853 99.85466451370849 64.26716220997277 99.03999999999999 64.29 C 98.83999999999999 64.29 98.63999999999999 64.33000000000001 98.42999999999999 64.37 C 98.42999999999999 64.37 87.08999999999999 65.78 77.88 65.02000000000001 C 65.72999999999999 64.05000000000001 59.11 61.83000000000001 59.11 61.83000000000001 C 58.63 61.670000000000016 58.1 61.59000000000001 57.62 61.670000000000016 Z M 57.62 46.46 C 55.919999999999995 46.7 54.75 48.24 55 49.89 C 55.1668510745875 51.10380621250818 56.039448735907676 52.10218371690154 57.22 52.43 C 57.22 52.43 64.69 54.89 77.4 55.94 C 87.61000000000001 56.79 99.2 55.21 99.2 55.21 C 100.9 55.17 102.23 53.76 102.19 52.06 C 102.17037752125772 51.245251699678214 101.82707564255573 50.47186585376654 101.23597809465886 49.91079230830253 C 100.644880546762 49.34971876283853 99.85466451370849 49.047162209972754 99.03999999999999 49.07 C 98.83999999999999 49.07 98.63999999999999 49.11 98.42999999999999 49.15 C 98.42999999999999 49.15 87.08999999999999 50.559999999999995 77.88 49.8 C 65.72999999999999 48.83 59.11 46.61 59.11 46.61 C 58.63 46.45 58.1 46.37 57.62 46.45 Z M 57.62 31.240000000000002 C 55.919999999999995 31.48 54.75 33.02 55 34.67 C 55.1668510745875 35.88380621250818 56.039448735907676 36.882183716901544 57.22 37.21 C 57.22 37.21 64.69 39.67 77.4 40.72 C 87.61000000000001 41.57 99.2 39.99 99.2 39.99 C 100.9 39.95 102.23 38.54 102.19 36.84 C 102.17037752125772 36.025251699678215 101.82707564255573 35.25186585376654 101.23597809465886 34.690792308302534 C 100.644880546762 34.12971876283853 99.85466451370849 33.827162209972755 99.03999999999999 33.85 C 98.83999999999999 33.85 98.63999999999999 33.89 98.42999999999999 33.93 C 98.42999999999999 33.93 87.08999999999999 35.339999999999996 77.88 34.58 C 65.72999999999999 33.61 59.11 31.389999999999997 59.11 31.389999999999997 C 58.63 31.229999999999997 58.1 31.189999999999998 57.62 31.229999999999997 Z M 57.62 16.060000000000002 C 55.919999999999995 16.3 54.75 17.840000000000003 55 19.490000000000002 C 55.1668510745875 20.703806212508187 56.039448735907676 21.702183716901544 57.22 22.03 C 57.22 22.03 64.69 24.490000000000002 77.4 25.54 C 87.61000000000001 26.39 99.2 24.81 99.2 24.81 C 100.9 24.77 102.23 23.36 102.19 21.66 C 102.17037752125772 20.84525169967821 101.82707564255573 20.07186585376654 101.23597809465886 19.510792308302534 C 100.644880546762 18.949718762838526 99.8546645137085 18.64716220997276 99.03999999999999 18.67 C 98.83999999999999 18.67 98.63999999999999 18.71 98.42999999999999 18.75 C 98.42999999999999 18.75 87.08999999999999 20.16 77.88 19.4 C 65.72999999999999 18.43 59.11 16.209999999999997 59.11 16.209999999999997 C 58.637850878541954 16.01924514007714 58.12188500879498 15.963839409097599 57.62 16.049999999999997 Z M 36.31 0 C 20.32 0.12 14.39 5.05 14.39 5.05 L 14.39 124.42 C 14.39 124.42 20.2 119.41 38.93 120.18 C 57.66 120.95000000000002 61.5 127.53 84.50999999999999 127.97000000000001 C 107.52 128.41000000000003 113.28999999999999 124.42000000000002 113.28999999999999 124.42000000000002 L 113.60999999999999 2.750000000000014 C 113.60999999999999 2.750000000000014 103.28 5.7 83.09 5.86 C 62.95 6.01 58.11 0.73 39.62 0.12 C 38.49 0.04 37.4 0 36.31 0 Z M 49.67 7.79 C 49.67 7.79 59.36 10.98 77.24000000000001 11.870000000000001 C 92.38000000000001 12.64 107.52000000000001 10.38 107.52000000000001 10.38 L 107.52000000000001 118.53 C 107.52000000000001 118.53 99.85000000000001 122.57000000000001 80.68 121.19 C 65.82000000000001 120.14 49.480000000000004 114.49 49.480000000000004 114.49 L 49.68000000000001 7.799999999999997 Z M 40.35 10.620000000000001 C 42.050000000000004 10.620000000000001 43.46 11.990000000000002 43.46 13.73 C 43.46 15.469999999999999 42.09 16.84 40.35 16.84 C 40.35 16.84 35.34 16.88 32.28 17.16 C 27.150000000000002 17.68 23.64 19.54 23.64 19.54 C 22.150000000000002 20.349999999999998 20.25 19.74 19.48 18.25 C 18.67 16.76 19.28 14.86 20.77 14.09 C 22.259999999999998 13.32 25.33 11.67 31.67 11.06 C 35.34 10.66 40.35 10.620000000000001 40.35 10.620000000000001 Z M 37.36 25.880000000000003 C 39.06 25.840000000000003 40.35 25.880000000000003 40.35 25.880000000000003 C 42.050000000000004 26.080000000000002 43.260000000000005 27.62 43.050000000000004 29.310000000000002 C 42.88374644848126 30.726609090871516 41.76660909087151 31.843746448481262 40.35 32.010000000000005 C 40.35 32.010000000000005 35.34 32.050000000000004 32.28 32.330000000000005 C 27.150000000000002 32.85000000000001 23.64 34.71000000000001 23.64 34.71000000000001 C 22.150000000000002 35.52000000000001 20.25 34.91000000000001 19.48 33.42000000000001 C 18.67 31.93000000000001 19.28 30.03000000000001 20.77 29.26000000000001 C 20.77 29.26000000000001 25.33 26.84000000000001 31.67 26.230000000000008 C 33.53 25.99000000000001 35.67 25.910000000000007 37.36 25.870000000000008 Z M 40.35 41.06 C 42.050000000000004 41.06 43.46 42.43 43.46 44.17 C 43.46 45.910000000000004 42.09 47.28 40.35 47.28 C 40.35 47.28 35.34 47.24 32.28 47.56 C 27.150000000000002 48.080000000000005 23.64 49.940000000000005 23.64 49.940000000000005 C 22.150000000000002 50.75000000000001 20.25 50.14000000000001 19.48 48.650000000000006 C 18.67 47.160000000000004 19.28 45.260000000000005 20.77 44.49000000000001 C 20.77 44.49000000000001 25.33 42.07000000000001 31.67 41.46000000000001 C 35.34 41.02000000000001 40.35 41.06000000000001 40.35 41.06000000000001 Z" stroke-linecap="round" />
</g>
<g transform="matrix(0.9 0 0 0.9 94 36)" style="" id="385bde16-f9fa-4222-bfea-1d5d5efcf730" >
<text xml:space="preserve" font-family="Lato" font-size="15" font-style="normal" font-weight="100" style="stroke: none; stroke-width: 1; stroke-dasharray: none; stroke-linecap: butt; stroke-dashoffset: 0; stroke-linejoin: miter; stroke-miterlimit: 4; fill: rgb(0,0,0); fill-rule: nonzero; opacity: 1; white-space: pre;" ><tspan x="-48.68" y="4.71" >Read The Docs</tspan></text>
</g>
</svg>
\ No newline at end of file
常见问题解答
============
1.在较新版本的mac上使用命令安装pip install magic-pdf[full] zsh: no matches found: magic-pdf[full]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
在 macOS 上,默认的 shell 从 Bash 切换到了 Z shell,而 Z shell 对于某些类型的字符串匹配有特殊的处理逻辑,这可能导致no matches found错误。 可以通过在命令行禁用globbing特性,再尝试运行安装命令
.. code:: bash
setopt no_nomatch
pip install magic-pdf[full]
2.使用过程中遇到_pickle.UnpicklingError: invalid load key, ‘v’.错误
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
可能是由于模型文件未下载完整导致,可尝试重新下载模型文件后再试。参考:https://github.com/opendatalab/MinerU/issues/143
3.模型文件应该下载到哪里/models-dir的配置应该怎么填
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
模型文件的路径输入是在”magic-pdf.json”中通过
.. code:: json
{
"models-dir": "/tmp/models"
}
进行配置的。这个路径是绝对路径而不是相对路径,绝对路径的获取可在models目录中通过命令 “pwd” 获取。
参考:https://github.com/opendatalab/MinerU/issues/155#issuecomment-2230216874
4.在WSL2的Ubuntu22.04中遇到报错\ ``ImportError: libGL.so.1: cannot open shared object file: No such file or directory``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
WSL2的Ubuntu22.04中缺少\ ``libgl``\ 库,可通过以下命令安装\ ``libgl``\ 库解决:
.. code:: bash
sudo apt-get install libgl1-mesa-glx
参考:https://github.com/opendatalab/MinerU/issues/388
5.遇到报错 ``ModuleNotFoundError : Nomodulenamed 'fairscale'``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
需要卸载该模块并重新安装
.. code:: bash
pip uninstall fairscale
pip install fairscale
参考:https://github.com/opendatalab/MinerU/issues/411
6.在部分较新的设备如H100上,使用CUDA加速OCR时解析出的文字乱码。
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
cuda11对新显卡的兼容性不好,需要升级paddle使用的cuda版本
.. code:: bash
pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu123/
参考:https://github.com/opendatalab/MinerU/issues/558
7.在部分Linux服务器上,程序一运行就报错 ``非法指令 (核心已转储)`` 或 ``Illegal instruction (core dumped)``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
可能是因为服务器CPU不支持AVX/AVX2指令集,或cpu本身支持但被运维禁用了,可以尝试联系运维解除限制或更换服务器。
参考:https://github.com/opendatalab/MinerU/issues/591 ,https://github.com/opendatalab/MinerU/issues/736
名词解释
===========
1. jsonl
TODO: add description
2. magic-pdf.json
TODO: add description
已知问题
============
- 阅读顺序基于模型对可阅读内容在空间中的分布进行排序,在极端复杂的排版下可能会部分区域乱序
- 不支持竖排文字
- 目录和列表通过规则进行识别,少部分不常见的列表形式可能无法识别
- 标题只有一级,目前不支持标题分级
- 代码块在layout模型里还没有支持
- 漫画书、艺术图册、小学教材、习题尚不能很好解析
- 表格识别在复杂表格上可能会出现行/列识别错误
- 在小语种PDF上,OCR识别可能会出现字符不准确的情况(如拉丁文的重音符号、阿拉伯文易混淆字符等)
- 部分公式可能会无法在markdown中渲染
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment