Unverified Commit bcbbee8c authored by Xiaomeng Zhao's avatar Xiaomeng Zhao Committed by GitHub
Browse files

Merge pull request #2622 from myhloli/dev

Dev
parents 3cc3f754 ced5a7b4
This diff is collapsed.
FAQ
==========================
1. When using the command ``pip install magic-pdf[full]`` on newer versions of macOS, the error ``zsh: no matches found: magic-pdf[full]`` occurs.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
On macOS, the default shell has switched from Bash to Z shell, which has
special handling logic for certain types of string matching. This can
lead to the “no matches found” error. You can try disabling the globbing
feature in the command line and then run the installation command again.
.. code:: bash
setopt no_nomatch
pip install magic-pdf[full]
2. Encountering the error ``pickle.UnpicklingError: invalid load key, 'v'.`` during use
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This might be due to an incomplete download of the model file. You can
try re-downloading the model file and then try again. Reference:
https://github.com/opendatalab/MinerU/issues/143
3. Where should the model files be downloaded and how should the ``/models-dir`` configuration be set?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The path for the model files is configured in “magic-pdf.json”. just
like:
.. code:: json
{
"models-dir": "/tmp/models"
}
This path is an absolute path, not a relative path. You can obtain the
absolute path in the models directory using the “pwd” command.
Reference:
https://github.com/opendatalab/MinerU/issues/155#issuecomment-2230216874
4. Encountered the error ``ImportError: libGL.so.1: cannot open shared object file: No such file or directory`` in Ubuntu 22.04 on WSL2
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The ``libgl`` library is missing in Ubuntu 22.04 on WSL2. You can
install the ``libgl`` library with the following command to resolve the
issue:
.. code:: bash
sudo apt-get install libgl1-mesa-glx
Reference: https://github.com/opendatalab/MinerU/issues/388
5. Encountered error ``ModuleNotFoundError: No module named 'fairscale'``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
You need to uninstall the module and reinstall it:
.. code:: bash
pip uninstall fairscale
pip install fairscale
Reference: https://github.com/opendatalab/MinerU/issues/411
6. On some newer devices like the H100, the text parsed during OCR using CUDA acceleration is garbled.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The compatibility of cuda11 with new graphics cards is poor, and the
CUDA version used by Paddle needs to be upgraded.
.. code:: bash
pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu123/
Reference: https://github.com/opendatalab/MinerU/issues/558
7. On some Linux servers, the program immediately reports an error ``Illegal instruction (core dumped)``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This might be because the server's CPU does not support the AVX/AVX2
instruction set, or the CPU itself supports it but has been disabled by
the system administrator. You can try contacting the system
administrator to remove the restriction or change to a different server.
References: https://github.com/opendatalab/MinerU/issues/591 ,
https://github.com/opendatalab/MinerU/issues/736
Glossary
===========
1. jsonl
Newline-delimited (\n), and each line must be a valid, independent JSON object.
Currently, All the function shipped with **MinerU** assume that json object must contain one field named with either **path** or **file_location**
2. magic-pdf.json
TODO
Known Issues
============
- Reading order is determined by the model based on the spatial
distribution of readable content, and may be out of order in some
areas under extremely complex layouts.
- Vertical text is not supported.
- Tables of contents and lists are recognized through rules, and some
uncommon list formats may not be recognized.
- Only one level of headings is supported; hierarchical headings are
not currently supported.
- Code blocks are not yet supported in the layout model.
- Comic books, art albums, primary school textbooks, and exercises
cannot be parsed well.
- Table recognition may result in row/column recognition errors in
complex tables.
- OCR recognition may produce inaccurate characters in PDFs of
lesser-known languages (e.g., diacritical marks in Latin script,
easily confused characters in Arabic script).
- Some formulas may not render correctly in Markdown.
\ No newline at end of file
.. toctree::
:maxdepth: 2
api/dataset
api/data_reader_writer
api/read_api
api/schemas
api/io
api/pipe_operators
api/model_operators
\ No newline at end of file
Data Reader Writer
===================
.. autoclass:: magic_pdf.data.data_reader_writer.DataReader
:members:
:inherited-members:
:show-inheritance:
.. autoclass:: magic_pdf.data.data_reader_writer.DataWriter
:members:
:inherited-members:
:show-inheritance:
.. autoclass:: magic_pdf.data.data_reader_writer.S3DataReader
:members:
:inherited-members:
:show-inheritance:
.. autoclass:: magic_pdf.data.data_reader_writer.S3DataWriter
:members:
:inherited-members:
:show-inheritance:
.. autoclass:: magic_pdf.data.data_reader_writer.FileBasedDataReader
:members:
:inherited-members:
:show-inheritance:
.. autoclass:: magic_pdf.data.data_reader_writer.FileBasedDataWriter
:members:
:inherited-members:
:show-inheritance:
.. autoclass:: magic_pdf.data.data_reader_writer.MultiBucketS3DataReader
:members:
:inherited-members:
:show-inheritance:
.. autoclass:: magic_pdf.data.data_reader_writer.MultiBucketS3DataWriter
:members:
:inherited-members:
:show-inheritance:
Dataset
========
.. autoclass:: magic_pdf.data.dataset.PageableData
:members:
:inherited-members:
:show-inheritance:
.. autoclass:: magic_pdf.data.dataset.Dataset
:members:
:inherited-members:
:show-inheritance:
.. autoclass:: magic_pdf.data.dataset.ImageDataset
:members:
:inherited-members:
:show-inheritance:
.. autoclass:: magic_pdf.data.dataset.PymuDocDataset
:members:
:inherited-members:
:show-inheritance:
.. autoclass:: magic_pdf.data.dataset.Doc
:members:
:inherited-members:
:show-inheritance:
IO
==
.. autoclass:: magic_pdf.data.io.base.IOReader
:members:
:inherited-members:
:show-inheritance:
.. autoclass:: magic_pdf.data.io.base.IOWriter
:members:
:inherited-members:
:show-inheritance:
.. autoclass:: magic_pdf.data.io.s3.S3Reader
:members:
:inherited-members:
:show-inheritance:
.. autoclass:: magic_pdf.data.io.s3.S3Writer
:members:
:inherited-members:
:show-inheritance:
.. autoclass:: magic_pdf.data.io.http.HttpReader
:members:
:inherited-members:
:show-inheritance:
.. autoclass:: magic_pdf.data.io.http.HttpWriter
:members:
:inherited-members:
:show-inheritance:
Model Api
==========
.. autoclass:: magic_pdf.operators.InferenceResultBase
:members:
:inherited-members:
:show-inheritance:
Pipeline Api
=============
.. autoclass:: magic_pdf.operators.pipes.PipeResult
:members:
:inherited-members:
:show-inheritance:
read_api
=========
.. automodule:: magic_pdf.data.read_api
:members:
:inherited-members:
schemas
===========
.. autopydantic_model:: magic_pdf.data.schemas.S3Config
:members:
.. autopydantic_model:: magic_pdf.data.schemas.PageInfo
:members:
# Configuration file for the Sphinx documentation builder.
#
# This file only contains a selection of the most common options. For a full
# list see the documentation:
# https://www.sphinx-doc.org/en/master/usage/configuration.html
# -- Path setup --------------------------------------------------------------
# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
import os
import subprocess
import sys
from sphinx.ext import autodoc
from docutils import nodes
from docutils.parsers.rst import Directive
def install(package):
subprocess.check_call([sys.executable, '-m', 'pip', 'install', package])
requirements_path = os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'requirements.txt'))
if os.path.exists(requirements_path):
with open(requirements_path) as f:
packages = f.readlines()
for package in packages:
install(package.strip())
sys.path.insert(0, os.path.abspath('../..'))
# -- Project information -----------------------------------------------------
project = 'MinerU'
copyright = '2024, MinerU Contributors'
author = 'OpenDataLab'
# The full version, including alpha/beta/rc tags
version_file = '../../magic_pdf/libs/version.py'
with open(version_file) as f:
exec(compile(f.read(), version_file, 'exec'))
__version__ = locals()['__version__']
# The short X.Y version
version = __version__
# The full version, including alpha/beta/rc tags
release = __version__
# -- General configuration ---------------------------------------------------
# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = [
'sphinx.ext.napoleon',
'sphinx.ext.viewcode',
'sphinx.ext.intersphinx',
'sphinx_copybutton',
'sphinx.ext.autodoc',
'sphinx.ext.autosummary',
'sphinx.ext.inheritance_diagram',
'myst_parser',
'sphinxarg.ext',
'sphinxcontrib.autodoc_pydantic',
]
# class hierarchy diagram
inheritance_graph_attrs = dict(rankdir="LR", size='"8.0, 12.0"', fontsize=14, ratio='compress')
inheritance_node_attrs = dict(shape='ellipse', fontsize=14, height=0.75)
inheritance_edge_attrs = dict(arrow='vee')
autodoc_pydantic_model_show_json = True
autodoc_pydantic_model_show_config_summary = False
# Add any paths that contain templates here, relative to this directory.
templates_path = ['_templates']
# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
# This pattern also affects html_static_path and html_extra_path.
exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
# Exclude the prompt "$" when copying code
copybutton_prompt_text = r'\$ '
copybutton_prompt_is_regexp = True
language = 'en'
# -- Options for HTML output -------------------------------------------------
# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
#
html_theme = 'sphinx_book_theme'
html_logo = '_static/image/logo.png'
html_theme_options = {
'path_to_docs': 'next_docs/en',
'repository_url': 'https://github.com/opendatalab/MinerU',
'use_repository_button': True,
}
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
# html_static_path = ['_static']
# Mock out external dependencies here.
autodoc_mock_imports = [
'cpuinfo',
'torch',
'transformers',
'psutil',
'prometheus_client',
'sentencepiece',
'vllm.cuda_utils',
'vllm._C',
# 'numpy',
'tqdm',
]
class MockedClassDocumenter(autodoc.ClassDocumenter):
"""Remove note about base class when a class is derived from object."""
def add_line(self, line: str, source: str, *lineno: int) -> None:
if line == ' Bases: :py:class:`object`':
return
super().add_line(line, source, *lineno)
autodoc.ClassDocumenter = MockedClassDocumenter
navigation_with_keys = False
# add custom directive
class VideoDirective(Directive):
required_arguments = 1
optional_arguments = 0
final_argument_whitespace = True
option_spec = {}
def run(self):
url = self.arguments[0]
video_node = nodes.raw('', f'<iframe width="560" height="315" src="{url}" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>', format='html')
return [video_node]
def setup(app):
app.add_directive('video', VideoDirective)
\ No newline at end of file
.. xtuner documentation master file, created by
sphinx-quickstart on Tue Jan 9 16:33:06 2024.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
Welcome to the MinerU Documentation
==============================================
.. figure:: ./_static/image/logo.png
:align: center
:alt: mineru
:class: no-scaled-link
.. raw:: html
<p style="text-align:center">
<strong>A one-stop, open-source, high-quality data extraction tool
</strong>
</p>
<p style="text-align:center">
<script async defer src="https://buttons.github.io/buttons.js"></script>
<a class="github-button" href="https://github.com/opendatalab/MinerU" data-show-count="true" data-size="large" aria-label="Star">Star</a>
<a class="github-button" href="https://github.com/opendatalab/MinerU/subscription" data-icon="octicon-eye" data-size="large" aria-label="Watch">Watch</a>
<a class="github-button" href="https://github.com/opendatalab/MinerU/fork" data-icon="octicon-repo-forked" data-size="large" aria-label="Fork">Fork</a>
</p>
Project Introduction
--------------------
MinerU is a tool that converts PDFs into machine-readable formats (e.g.,
markdown, JSON), allowing for easy extraction into any format. MinerU
was born during the pre-training process of
`InternLM <https://github.com/InternLM/InternLM>`__. We focus on solving
symbol conversion issues in scientific literature and hope to contribute
to technological development in the era of large models. Compared to
well-known commercial products, MinerU is still young. If you encounter
any issues or if the results are not as expected, please submit an issue
on `issue <https://github.com/opendatalab/MinerU/issues>`__ and **attach
the relevant PDF**.
.. video:: https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
Key Features
------------
- Remove headers, footers, footnotes, page numbers, etc., to ensure
semantic coherence.
- Output text in human-readable order, suitable for single-column,
multi-column, and complex layouts.
- Preserve the structure of the original document, including headings,
paragraphs, lists, etc.
- Extract images, image descriptions, tables, table titles, and
footnotes.
- Automatically recognize and convert formulas in the document to LaTeX
format.
- Automatically recognize and convert tables in the document to LaTeX
or HTML format.
- Automatically detect scanned PDFs and garbled PDFs and enable OCR
functionality.
- OCR supports detection and recognition of 84 languages.
- Supports multiple output formats, such as multimodal and NLP
Markdown, JSON sorted by reading order, and rich intermediate
formats.
- Supports various visualization results, including layout
visualization and span visualization, for efficient confirmation of
output quality.
- Supports both CPU and GPU environments.
- Compatible with Windows, Linux, and Mac platforms.
.. tip::
Get started with MinerU by trying the `online demo <https://www.modelscope.cn/studios/OpenDataLab/MinerU>`_ or :doc:`installing it locally <user_guide/install/install>`.
User Guide
-------------
.. toctree::
:maxdepth: 2
:caption: User Guide
user_guide
API Reference
-------------
If you are looking for information on a specific function, class or
method, this part of the documentation is for you.
.. toctree::
:maxdepth: 2
:caption: API
api
Additional Notes
------------------
.. toctree::
:maxdepth: 1
:caption: Additional Notes
additional_notes/known_issues
additional_notes/faq
additional_notes/glossary
@ECHO OFF
pushd %~dp0
REM Command file for Sphinx documentation
if "%SPHINXBUILD%" == "" (
set SPHINXBUILD=sphinx-build
)
set SOURCEDIR=.
set BUILDDIR=_build
%SPHINXBUILD% >NUL 2>NUL
if errorlevel 9009 (
echo.
echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
echo.installed, then set the SPHINXBUILD environment variable to point
echo.to the full path of the 'sphinx-build' executable. Alternatively you
echo.may add the Sphinx directory to PATH.
echo.
echo.If you don't have Sphinx installed, grab it from
echo.https://www.sphinx-doc.org/
exit /b 1
)
if "%1" == "" goto help
%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
goto end
:help
%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
:end
popd
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment