Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
wangsen
MinerU
Commits
8b119e22
Unverified
Commit
8b119e22
authored
Nov 01, 2024
by
Xiaomeng Zhao
Committed by
GitHub
Nov 01, 2024
Browse files
Merge pull request #833 from icecraft/feat/tune_docs
Feat/tune docs
parents
099f19f2
065bf993
Changes
54
Hide whitespace changes
Inline
Side-by-side
Showing
14 changed files
with
983 additions
and
11 deletions
+983
-11
next_docs/en/user_guide/install/download_model_weight_files.rst
...ocs/en/user_guide/install/download_model_weight_files.rst
+48
-0
next_docs/en/user_guide/install/install.rst
next_docs/en/user_guide/install/install.rst
+107
-0
next_docs/en/user_guide/quick_start.rst
next_docs/en/user_guide/quick_start.rst
+13
-0
next_docs/en/user_guide/quick_start/command_line.rst
next_docs/en/user_guide/quick_start/command_line.rst
+59
-0
next_docs/en/user_guide/quick_start/extract_text.rst
next_docs/en/user_guide/quick_start/extract_text.rst
+10
-0
next_docs/en/user_guide/quick_start/to_markdown.rst
next_docs/en/user_guide/quick_start/to_markdown.rst
+52
-0
next_docs/en/user_guide/tutorial.rst
next_docs/en/user_guide/tutorial.rst
+10
-0
next_docs/en/user_guide/tutorial/output_file_description.rst
next_docs/en/user_guide/tutorial/output_file_description.rst
+416
-0
next_docs/requirements.txt
next_docs/requirements.txt
+5
-4
next_docs/zh_cn/.readthedocs.yaml
next_docs/zh_cn/.readthedocs.yaml
+2
-2
scripts/download_models.py
scripts/download_models.py
+59
-0
scripts/download_models_hf.py
scripts/download_models_hf.py
+66
-0
tests/test_data/data_reader_writer/test_multi_bucket_s3.py
tests/test_data/data_reader_writer/test_multi_bucket_s3.py
+80
-2
tests/test_data/data_reader_writer/test_s3.py
tests/test_data/data_reader_writer/test_s3.py
+56
-3
No files found.
next_docs/en/user_guide/install/download_model_weight_files.rst
0 → 100644
View file @
8b119e22
Download
Model
Weight
Files
==============================
Model
downloads
are
divided
into
initial
downloads
and
updates
to
the
model
directory
.
Please
refer
to
the
corresponding
documentation
for
instructions
on
how
to
proceed
.
Initial
download
of
model
files
------------------------------
1.
Download
the
Model
from
Hugging
Face
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Use
a
Python
Script
to
Download
Model
Files
from
Hugging
Face
..
code
::
bash
pip
install
huggingface_hub
wget
https
://
github
.
com
/
opendatalab
/
MinerU
/
raw
/
master
/
scripts
/
download_models_hf
.
py
-
O
download_models_hf
.
py
python
download_models_hf
.
py
The
Python
script
will
automatically
download
the
model
files
and
configure
the
model
directory
in
the
configuration
file
.
The
configuration
file
can
be
found
in
the
user
directory
,
with
the
filename
``
magic
-
pdf
.
json
``.
How
to
update
models
previously
downloaded
-----------------------------------------
1.
Models
downloaded
via
Git
LFS
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Due
to
feedback
from
some
users
that
downloading
model
files
using
git
lfs
was
incomplete
or
resulted
in
corrupted
model
files
,
this
method
is
no
longer
recommended
.
If
you
previously
downloaded
model
files
via
git
lfs
,
you
can
navigate
to
the
previous
download
directory
and
use
the
``
git
pull
``
command
to
update
the
model
.
2.
Models
downloaded
via
Hugging
Face
or
Model
Scope
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If
you
previously
downloaded
models
via
Hugging
Face
or
Model
Scope
,
you
can
rerun
the
Python
script
used
for
the
initial
download
.
This
will
automatically
update
the
model
directory
to
the
latest
version
.
next_docs/en/user_guide/install/install.rst
0 → 100644
View file @
8b119e22
Install
===============================================================
If you encounter any installation issues, please first consult the FAQ.
If the parsing results are not as expected, refer to the Known Issues.
There are three different ways to experience MinerU
Pre-installation Notice—Hardware and Software Environment Support
------------------------------------------------------------------
To ensure the stability and reliability of the project, we only optimize
and test for specific hardware and software environments during
development. This ensures that users deploying and running the project
on recommended system configurations will get the best performance with
the fewest compatibility issues.
By focusing resources on the mainline environment, our team can more
efficiently resolve potential bugs and develop new features.
In non-mainline environments, due to the diversity of hardware and
software configurations, as well as third-party dependency compatibility
issues, we cannot guarantee 100% project availability. Therefore, for
users who wish to use this project in non-recommended environments, we
suggest carefully reading the documentation and FAQ first. Most issues
already have corresponding solutions in the FAQ. We also encourage
community feedback to help us gradually expand support.
.. raw:: html
<style>
table, th, td {
border: 1px solid black;
border-collapse: collapse;
}
</style>
<table>
<tr>
<td colspan="3" rowspan="2">Operating System</td>
</tr>
<tr>
<td>Ubuntu 22.04 LTS</td>
<td>Windows 10 / 11</td>
<td>macOS 11+</td>
</tr>
<tr>
<td colspan="3">CPU</td>
<td>x86_64</td>
<td>x86_64</td>
<td>x86_64 / arm64</td>
</tr>
<tr>
<td colspan="3">Memory</td>
<td colspan="3">16GB or more, recommended 32GB+</td>
</tr>
<tr>
<td colspan="3">Python Version</td>
<td colspan="3">3.10</td>
</tr>
<tr>
<td colspan="3">Nvidia Driver Version</td>
<td>latest (Proprietary Driver)</td>
<td>latest</td>
<td>None</td>
</tr>
<tr>
<td colspan="3">CUDA Environment</td>
<td>Automatic installation [12.1 (pytorch) + 11.8 (paddle)]</td>
<td>11.8 (manual installation) + cuDNN v8.7.0 (manual installation)</td>
<td>None</td>
</tr>
<tr>
<td rowspan="2">GPU Hardware Support List</td>
<td colspan="2">Minimum Requirement 8G+ VRAM</td>
<td colspan="2">3060ti/3070/3080/3080ti/4060/4070/4070ti<br>
8G VRAM enables layout, formula recognition acceleration and OCR acceleration</td>
<td rowspan="2">None</td>
</tr>
<tr>
<td colspan="2">Recommended Configuration 16G+ VRAM</td>
<td colspan="2">3090/3090ti/4070ti super/4080/4090<br>
16G VRAM or more can enable layout, formula recognition, OCR acceleration and table recognition acceleration simultaneously
</td>
</tr>
</table>
Create an environment
~~~~~~~~~~~~~~~~~~~~~
.. code-block:: shell
conda create -n MinerU python=3.10
conda activate MinerU
pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com
Download model weight files
~~~~~~~~~~~~~~~~~~~~~~~~~~
.. code-block:: shell
pip install huggingface_hub
wget https://github.com/opendatalab/MinerU/raw/master/scripts/download_models_hf.py -O download_models_hf.py
python download_models_hf.py
The MinerU is installed, Check out :doc:`../quick_start` or reading :doc:`boost_with_cuda` for accelerate inference
\ No newline at end of file
next_docs/en/user_guide/quick_start.rst
0 → 100644
View file @
8b119e22
Quick Start
==============
Eager to get started? This page gives a good introduction to MinerU. Follow Installation to set up a project and install MinerU first.
.. toctree::
:maxdepth: 1
quick_start/command_line
quick_start/to_markdown
next_docs/en/user_guide/quick_start/command_line.rst
0 → 100644
View file @
8b119e22
Command Line
===================
.. code:: bash
magic-pdf --help
Usage: magic-pdf [OPTIONS]
Options:
-v, --version display the version and exit
-p, --path PATH local pdf filepath or directory [required]
-o, --output-dir PATH output local directory [required]
-m, --method [ocr|txt|auto] the method for parsing pdf. ocr: using ocr
technique to extract information from pdf. txt:
suitable for the text-based pdf only and
outperform ocr. auto: automatically choose the
best method for parsing pdf from ocr and txt.
without method specified, auto will be used by
default.
-l, --lang TEXT Input the languages in the pdf (if known) to
improve OCR accuracy. Optional. You should
input "Abbreviation" with language form url: ht
tps://paddlepaddle.github.io/PaddleOCR/en/ppocr
/blog/multi_languages.html#5-support-languages-
and-abbreviations
-d, --debug BOOLEAN Enables detailed debugging information during
the execution of the CLI commands.
-s, --start INTEGER The starting page for PDF parsing, beginning
from 0.
-e, --end INTEGER The ending page for PDF parsing, beginning from
0.
--help Show this message and exit.
## show version
magic-pdf -v
## command line example
magic-pdf -p {some_pdf} -o {some_output_dir} -m auto
``{some_pdf}`` can be a single PDF file or a directory containing
multiple PDFs. The results will be saved in the ``{some_output_dir}``
directory. The output file list is as follows:
.. code:: text
├── some_pdf.md # markdown file
├── images # directory for storing images
├── some_pdf_layout.pdf # layout diagram
├── some_pdf_middle.json # MinerU intermediate processing result
├── some_pdf_model.json # model inference result
├── some_pdf_origin.pdf # original PDF file
├── some_pdf_spans.pdf # smallest granularity bbox position information diagram
└── some_pdf_content_list.json # Rich text JSON arranged in reading order
For more information about the output files, please refer to the :doc:`../tutorial/output_file_description`
next_docs/en/user_guide/quick_start/extract_text.rst
0 → 100644
View file @
8b119e22
Extract Content from Pdf
========================
.. code:: python
from magic_pdf.data.read_api import read_local_pdfs
from magic_pdf.pdf_parse_union_core_v2 import pdf_parse_union
from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
next_docs/en/user_guide/quick_start/to_markdown.rst
0 → 100644
View file @
8b119e22
Convert To Markdown
========================
.. code:: python
import os
from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader
from magic_pdf.libs.MakeContentConfig import DropMode, MakeMode
from magic_pdf.pipe.OCRPipe import OCRPipe
## args
model_list = []
pdf_file_name = "abc.pdf" # replace with the real pdf path
## prepare env
local_image_dir, local_md_dir = "output/images", "output"
os.makedirs(local_image_dir, exist_ok=True)
image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter(
local_md_dir
) # create 00
image_dir = str(os.path.basename(local_image_dir))
reader1 = FileBasedDataReader("")
pdf_bytes = reader1.read(pdf_file_name) # read the pdf content
pipe = OCRPipe(pdf_bytes, model_list, image_writer)
pipe.pipe_classify()
pipe.pipe_analyze()
pipe.pipe_parse()
pdf_info = pipe.pdf_mid_data["pdf_info"]
md_content = pipe.pipe_mk_markdown(
image_dir, drop_mode=DropMode.NONE, md_make_mode=MakeMode.MM_MD
)
if isinstance(md_content, list):
md_writer.write_string(f"{pdf_file_name}.md", "\n".join(md_content))
else:
md_writer.write_string(f"{pdf_file_name}.md", md_content)
Check :doc:`../data/data_reader_writer` for more [reader | writer] examples
next_docs/en/user_guide/tutorial.rst
0 → 100644
View file @
8b119e22
Tutorial
===========
From the beginning to the end, Show how to using mineru via a minimal project
.. toctree::
:maxdepth: 1
tutorial/output_file_description
\ No newline at end of file
next_docs/en/user_guide/tutorial/output_file_description.rst
0 → 100644
View file @
8b119e22
Output File Description
=========================
After executing the ``magic-pdf`` command, in addition to outputting
files related to markdown, several other files unrelated to markdown
will also be generated. These files will be introduced one by one.
some_pdf_layout.pdf
~~~~~~~~~~~~~~~~~~~
Each page layout consists of one or more boxes. The number at the top
left of each box indicates its sequence number. Additionally, in
``layout.pdf``, different content blocks are highlighted with different
background colors.
.. figure:: ../../_static/image/layout_example.png
:alt: layout example
layout example
some_pdf_spans.pdf
~~~~~~~~~~~~~~~~~~
All spans on the page are drawn with different colored line frames
according to the span type. This file can be used for quality control,
allowing for quick identification of issues such as missing text or
unrecognized inline formulas.
.. figure:: ../../_static/image/spans_example.png
:alt: spans example
spans example
some_pdf_model.json
~~~~~~~~~~~~~~~~~~~
Structure Definition
^^^^^^^^^^^^^^^^^^^^
.. code:: python
from pydantic import BaseModel, Field
from enum import IntEnum
class CategoryType(IntEnum):
title = 0 # Title
plain_text = 1 # Text
abandon = 2 # Includes headers, footers, page numbers, and page annotations
figure = 3 # Image
figure_caption = 4 # Image description
table = 5 # Table
table_caption = 6 # Table description
table_footnote = 7 # Table footnote
isolate_formula = 8 # Block formula
formula_caption = 9 # Formula label
embedding = 13 # Inline formula
isolated = 14 # Block formula
text = 15 # OCR recognition result
class PageInfo(BaseModel):
page_no: int = Field(description="Page number, the first page is 0", ge=0)
height: int = Field(description="Page height", gt=0)
width: int = Field(description="Page width", ge=0)
class ObjectInferenceResult(BaseModel):
category_id: CategoryType = Field(description="Category", ge=0)
poly: list[float] = Field(description="Quadrilateral coordinates, representing the coordinates of the top-left, top-right, bottom-right, and bottom-left points respectively")
score: float = Field(description="Confidence of the inference result")
latex: str | None = Field(description="LaTeX parsing result", default=None)
html: str | None = Field(description="HTML parsing result", default=None)
class PageInferenceResults(BaseModel):
layout_dets: list[ObjectInferenceResult] = Field(description="Page recognition results", ge=0)
page_info: PageInfo = Field(description="Page metadata")
# The inference results of all pages, ordered by page number, are stored in a list as the inference results of MinerU
inference_result: list[PageInferenceResults] = []
The format of the poly coordinates is [x0, y0, x1, y1, x2, y2, x3, y3],
representing the coordinates of the top-left, top-right, bottom-right,
and bottom-left points respectively. |Poly Coordinate Diagram|
example
^^^^^^^
.. code:: json
[
{
"layout_dets": [
{
"category_id": 2,
"poly": [
99.1906967163086,
100.3119125366211,
730.3707885742188,
100.3119125366211,
730.3707885742188,
245.81326293945312,
99.1906967163086,
245.81326293945312
],
"score": 0.9999997615814209
}
],
"page_info": {
"page_no": 0,
"height": 2339,
"width": 1654
}
},
{
"layout_dets": [
{
"category_id": 5,
"poly": [
99.13092803955078,
2210.680419921875,
497.3183898925781,
2210.680419921875,
497.3183898925781,
2264.78076171875,
99.13092803955078,
2264.78076171875
],
"score": 0.9999997019767761
}
],
"page_info": {
"page_no": 1,
"height": 2339,
"width": 1654
}
}
]
some_pdf_middle.json
~~~~~~~~~~~~~~~~~~~~
+-------+--------------------------------------------------------------+
| Field | Description |
| Name | |
+=======+==============================================================+
| pdf | list, each element is a dict representing the parsing result |
| _info | of each PDF page, see the table below for details |
+-------+--------------------------------------------------------------+
| \_ | ocr \| txt, used to indicate the mode used in this |
| parse | intermediate parsing state |
| _type | |
+-------+--------------------------------------------------------------+
| \_ve | string, indicates the version of magic-pdf used in this |
| rsion | parsing |
| _name | |
+-------+--------------------------------------------------------------+
**pdf_info**
Field structure description
+---------+------------------------------------------------------------+
| Field | Description |
| Name | |
+=========+============================================================+
| preproc | Intermediate result after PDF preprocessing, not yet |
| _blocks | segmented |
+---------+------------------------------------------------------------+
| layout | Layout segmentation results, containing layout direction |
| _bboxes | (vertical, horizontal), and bbox, sorted by reading order |
+---------+------------------------------------------------------------+
| p | Page number, starting from 0 |
| age_idx | |
+---------+------------------------------------------------------------+
| pa | Page width and height |
| ge_size | |
+---------+------------------------------------------------------------+
| \_layo | Layout tree structure |
| ut_tree | |
+---------+------------------------------------------------------------+
| images | list, each element is a dict representing an img_block |
+---------+------------------------------------------------------------+
| tables | list, each element is a dict representing a table_block |
+---------+------------------------------------------------------------+
| inter | list, each element is a dict representing an |
| line_eq | interline_equation_block |
| uations | |
+---------+------------------------------------------------------------+
| di | List, block information returned by the model that needs |
| scarded | to be dropped |
| _blocks | |
+---------+------------------------------------------------------------+
| para | Result after segmenting preproc_blocks |
| _blocks | |
+---------+------------------------------------------------------------+
In the above table, ``para_blocks`` is an array of dicts, each dict
representing a block structure. A block can support up to one level of
nesting.
**block**
The outer block is referred to as a first-level block, and the fields in
the first-level block include:
+---------+-------------------------------------------------------------+
| Field | Description |
| Name | |
+=========+=============================================================+
| type | Block type (table|image) |
+---------+-------------------------------------------------------------+
| bbox | Block bounding box coordinates |
+---------+-------------------------------------------------------------+
| blocks | list, each element is a dict representing a second-level |
| | block |
+---------+-------------------------------------------------------------+
There are only two types of first-level blocks: “table” and “image”. All
other blocks are second-level blocks.
The fields in a second-level block include:
+-----+----------------------------------------------------------------+
| Fi | Description |
| eld | |
| N | |
| ame | |
+=====+================================================================+
| t | Block type |
| ype | |
+-----+----------------------------------------------------------------+
| b | Block bounding box coordinates |
| box | |
+-----+----------------------------------------------------------------+
| li | list, each element is a dict representing a line, used to |
| nes | describe the composition of a line of information |
+-----+----------------------------------------------------------------+
Detailed explanation of second-level block types
================== ======================
type Description
================== ======================
image_body Main body of the image
image_caption Image description text
table_body Main body of the table
table_caption Table description text
table_footnote Table footnote
text Text block
title Title block
interline_equation Block formula
================== ======================
**line**
The field format of a line is as follows:
+-----+----------------------------------------------------------------+
| Fi | Description |
| eld | |
| N | |
| ame | |
+=====+================================================================+
| b | Bounding box coordinates of the line |
| box | |
+-----+----------------------------------------------------------------+
| sp | list, each element is a dict representing a span, used to |
| ans | describe the composition of the smallest unit |
+-----+----------------------------------------------------------------+
**span**
+----------+-----------------------------------------------------------+
| Field | Description |
| Name | |
+==========+===========================================================+
| bbox | Bounding box coordinates of the span |
+----------+-----------------------------------------------------------+
| type | Type of the span |
+----------+-----------------------------------------------------------+
| content | Text spans use content, chart spans use img_path to store |
| \| | the actual text or screenshot path information |
| img_path | |
+----------+-----------------------------------------------------------+
The types of spans are as follows:
================== ==============
type Description
================== ==============
image Image
table Table
text Text
inline_equation Inline formula
interline_equation Block formula
================== ==============
**Summary**
A span is the smallest storage unit for all elements.
The elements stored within para_blocks are block information.
The block structure is as follows:
First-level block (if any) -> Second-level block -> Line -> Span
.. _example-1:
example
^^^^^^^
.. code:: json
{
"pdf_info": [
{
"preproc_blocks": [
{
"type": "text",
"bbox": [
52,
61.956024169921875,
294,
82.99800872802734
],
"lines": [
{
"bbox": [
52,
61.956024169921875,
294,
72.0000228881836
],
"spans": [
{
"bbox": [
54.0,
61.956024169921875,
296.2261657714844,
72.0000228881836
],
"content": "dependent on the service headway and the reliability of the departure ",
"type": "text",
"score": 1.0
}
]
}
]
}
],
"layout_bboxes": [
{
"layout_bbox": [
52,
61,
294,
731
],
"layout_label": "V",
"sub_layout": []
}
],
"page_idx": 0,
"page_size": [
612.0,
792.0
],
"_layout_tree": [],
"images": [],
"tables": [],
"interline_equations": [],
"discarded_blocks": [],
"para_blocks": [
{
"type": "text",
"bbox": [
52,
61.956024169921875,
294,
82.99800872802734
],
"lines": [
{
"bbox": [
52,
61.956024169921875,
294,
72.0000228881836
],
"spans": [
{
"bbox": [
54.0,
61.956024169921875,
296.2261657714844,
72.0000228881836
],
"content": "dependent on the service headway and the reliability of the departure ",
"type": "text",
"score": 1.0
}
]
}
]
}
]
}
],
"_parse_type": "txt",
"_version_name": "0.6.1"
}
.. |Poly Coordinate Diagram| image:: ../../_static/image/poly.png
next_docs/requirements.txt
View file @
8b119e22
...
...
@@ -5,7 +5,8 @@ Pillow==8.4.0
pydantic>=2.7.2,<2.8.0
PyMuPDF>=1.24.9
sphinx
sphinx-argparse
sphinx-book-theme
sphinx-copybutton
sphinx_rtd_theme
sphinx-argparse>=0.5.2
sphinx-book-theme>=1.1.3
sphinx-copybutton>=0.5.2
sphinx_rtd_theme>=3.0.1
autodoc_pydantic>=2.2.0
\ No newline at end of file
next_docs/zh_cn/.readthedocs.yaml
View file @
8b119e22
...
...
@@ -10,7 +10,7 @@ formats:
python
:
install
:
-
requirements
:
docs/requirements.txt
-
requirements
:
next_
docs/requirements.txt
sphinx
:
configuration
:
docs/zh_cn/conf.py
configuration
:
next_
docs/zh_cn/conf.py
scripts/download_models.py
0 → 100644
View file @
8b119e22
import
json
import
os
import
requests
from
modelscope
import
snapshot_download
def
download_json
(
url
):
# 下载JSON文件
response
=
requests
.
get
(
url
)
response
.
raise_for_status
()
# 检查请求是否成功
return
response
.
json
()
def
download_and_modify_json
(
url
,
local_filename
,
modifications
):
if
os
.
path
.
exists
(
local_filename
):
data
=
json
.
load
(
open
(
local_filename
))
config_version
=
data
.
get
(
'config_version'
,
'0.0.0'
)
if
config_version
<
'1.0.0'
:
data
=
download_json
(
url
)
else
:
data
=
download_json
(
url
)
# 修改内容
for
key
,
value
in
modifications
.
items
():
data
[
key
]
=
value
# 保存修改后的内容
with
open
(
local_filename
,
'w'
,
encoding
=
'utf-8'
)
as
f
:
json
.
dump
(
data
,
f
,
ensure_ascii
=
False
,
indent
=
4
)
if
__name__
==
'__main__'
:
mineru_patterns
=
[
"models/Layout/LayoutLMv3/*"
,
"models/Layout/YOLO/*"
,
"models/MFD/YOLO/*"
,
"models/MFR/unimernet_small/*"
,
"models/TabRec/TableMaster/*"
,
"models/TabRec/StructEqTable/*"
,
]
model_dir
=
snapshot_download
(
'opendatalab/PDF-Extract-Kit-1.0'
,
allow_patterns
=
mineru_patterns
)
layoutreader_model_dir
=
snapshot_download
(
'ppaanngggg/layoutreader'
)
model_dir
=
model_dir
+
'/models'
print
(
f
'model_dir is:
{
model_dir
}
'
)
print
(
f
'layoutreader_model_dir is:
{
layoutreader_model_dir
}
'
)
json_url
=
'https://gitee.com/myhloli/MinerU/raw/dev/magic-pdf.template.json'
config_file_name
=
'magic-pdf.json'
home_dir
=
os
.
path
.
expanduser
(
'~'
)
config_file
=
os
.
path
.
join
(
home_dir
,
config_file_name
)
json_mods
=
{
'models-dir'
:
model_dir
,
'layoutreader-model-dir'
:
layoutreader_model_dir
,
}
download_and_modify_json
(
json_url
,
config_file
,
json_mods
)
print
(
f
'The configuration file has been configured successfully, the path is:
{
config_file
}
'
)
scripts/download_models_hf.py
0 → 100644
View file @
8b119e22
import
json
import
os
import
requests
from
huggingface_hub
import
snapshot_download
def
download_json
(
url
):
# 下载JSON文件
response
=
requests
.
get
(
url
)
response
.
raise_for_status
()
# 检查请求是否成功
return
response
.
json
()
def
download_and_modify_json
(
url
,
local_filename
,
modifications
):
if
os
.
path
.
exists
(
local_filename
):
data
=
json
.
load
(
open
(
local_filename
))
config_version
=
data
.
get
(
'config_version'
,
'0.0.0'
)
if
config_version
<
'1.0.0'
:
data
=
download_json
(
url
)
else
:
data
=
download_json
(
url
)
# 修改内容
for
key
,
value
in
modifications
.
items
():
data
[
key
]
=
value
# 保存修改后的内容
with
open
(
local_filename
,
'w'
,
encoding
=
'utf-8'
)
as
f
:
json
.
dump
(
data
,
f
,
ensure_ascii
=
False
,
indent
=
4
)
if
__name__
==
'__main__'
:
mineru_patterns
=
[
"models/Layout/LayoutLMv3/*"
,
"models/Layout/YOLO/*"
,
"models/MFD/YOLO/*"
,
"models/MFR/unimernet_small/*"
,
"models/TabRec/TableMaster/*"
,
"models/TabRec/StructEqTable/*"
,
]
model_dir
=
snapshot_download
(
'opendatalab/PDF-Extract-Kit-1.0'
,
allow_patterns
=
mineru_patterns
)
layoutreader_pattern
=
[
"*.json"
,
"*.safetensors"
,
]
layoutreader_model_dir
=
snapshot_download
(
'hantian/layoutreader'
,
allow_patterns
=
layoutreader_pattern
)
model_dir
=
model_dir
+
'/models'
print
(
f
'model_dir is:
{
model_dir
}
'
)
print
(
f
'layoutreader_model_dir is:
{
layoutreader_model_dir
}
'
)
json_url
=
'https://github.com/opendatalab/MinerU/raw/dev/magic-pdf.template.json'
config_file_name
=
'magic-pdf.json'
home_dir
=
os
.
path
.
expanduser
(
'~'
)
config_file
=
os
.
path
.
join
(
home_dir
,
config_file_name
)
json_mods
=
{
'models-dir'
:
model_dir
,
'layoutreader-model-dir'
:
layoutreader_model_dir
,
}
download_and_modify_json
(
json_url
,
config_file
,
json_mods
)
print
(
f
'The configuration file has been configured successfully, the path is:
{
config_file
}
'
)
tests/test_data/data_reader_writer/test_multi_bucket_s3.py
View file @
8b119e22
...
...
@@ -41,8 +41,8 @@ def test_multi_bucket_s3_reader_writer():
),
]
reader
=
MultiBucketS3DataReader
(
default_bucket
=
bucket
,
s3_configs
=
s3configs
)
writer
=
MultiBucketS3DataWriter
(
default_bucket
=
bucket
,
s3_configs
=
s3configs
)
reader
=
MultiBucketS3DataReader
(
bucket
,
s3configs
)
writer
=
MultiBucketS3DataWriter
(
bucket
,
s3configs
)
bits
=
reader
.
read
(
'meta-index/scihub/v001/scihub/part-66210c190659-000026.jsonl'
)
...
...
@@ -80,3 +80,81 @@ def test_multi_bucket_s3_reader_writer():
assert
'123'
.
encode
()
==
reader
.
read
(
'unittest/data/data_reader_writer/multi_bucket_s3_data/test02.txt'
)
@
pytest
.
mark
.
skipif
(
os
.
getenv
(
'S3_ACCESS_KEY_2'
,
None
)
is
None
,
reason
=
'need s3 config!'
)
def
test_multi_bucket_s3_reader_writer_with_prefix
():
"""test multi bucket s3 reader writer must config s3 config in the
environment export S3_BUCKET=xxx export S3_ACCESS_KEY=xxx export
S3_SECRET_KEY=xxx export S3_ENDPOINT=xxx.
export S3_BUCKET_2=xxx export S3_ACCESS_KEY_2=xxx export S3_SECRET_KEY_2=xxx export S3_ENDPOINT_2=xxx
"""
bucket
=
os
.
getenv
(
'S3_BUCKET'
,
''
)
ak
=
os
.
getenv
(
'S3_ACCESS_KEY'
,
''
)
sk
=
os
.
getenv
(
'S3_SECRET_KEY'
,
''
)
endpoint_url
=
os
.
getenv
(
'S3_ENDPOINT'
,
''
)
bucket_2
=
os
.
getenv
(
'S3_BUCKET_2'
,
''
)
ak_2
=
os
.
getenv
(
'S3_ACCESS_KEY_2'
,
''
)
sk_2
=
os
.
getenv
(
'S3_SECRET_KEY_2'
,
''
)
endpoint_url_2
=
os
.
getenv
(
'S3_ENDPOINT_2'
,
''
)
s3configs
=
[
S3Config
(
bucket_name
=
bucket
,
access_key
=
ak
,
secret_key
=
sk
,
endpoint_url
=
endpoint_url
),
S3Config
(
bucket_name
=
bucket_2
,
access_key
=
ak_2
,
secret_key
=
sk_2
,
endpoint_url
=
endpoint_url_2
,
),
]
prefix
=
'meta-index'
reader
=
MultiBucketS3DataReader
(
f
'
{
bucket
}
/
{
prefix
}
'
,
s3configs
)
writer
=
MultiBucketS3DataWriter
(
f
'
{
bucket
}
/
{
prefix
}
'
,
s3configs
)
bits
=
reader
.
read
(
'scihub/v001/scihub/part-66210c190659-000026.jsonl'
)
assert
bits
==
reader
.
read
(
f
's3://
{
bucket
}
/
{
prefix
}
/scihub/v001/scihub/part-66210c190659-000026.jsonl'
)
bits
=
reader
.
read
(
f
's3://
{
bucket_2
}
/enbook-scimag/78800000/libgen.scimag78872000-78872999/10.1017/cbo9780511770425.012.pdf'
)
docs
=
fitz
.
open
(
'pdf'
,
bits
)
assert
len
(
docs
)
==
10
bits
=
reader
.
read
(
'scihub/v001/scihub/part-66210c190659-000026.jsonl?bytes=566,713'
)
assert
bits
==
reader
.
read_at
(
'scihub/v001/scihub/part-66210c190659-000026.jsonl'
,
566
,
713
)
assert
len
(
json
.
loads
(
bits
))
>
0
writer
.
write_string
(
'unittest/data/data_reader_writer/multi_bucket_s3_data/test01.txt'
,
'abc'
)
assert
'abc'
.
encode
()
==
reader
.
read
(
'unittest/data/data_reader_writer/multi_bucket_s3_data/test01.txt'
)
assert
'abc'
.
encode
()
==
reader
.
read
(
f
's3://
{
bucket
}
/
{
prefix
}
/unittest/data/data_reader_writer/multi_bucket_s3_data/test01.txt'
)
writer
.
write
(
'unittest/data/data_reader_writer/multi_bucket_s3_data/test02.txt'
,
'123'
.
encode
(),
)
assert
'123'
.
encode
()
==
reader
.
read
(
'unittest/data/data_reader_writer/multi_bucket_s3_data/test02.txt'
)
tests/test_data/data_reader_writer/test_s3.py
View file @
8b119e22
...
...
@@ -9,7 +9,7 @@ from magic_pdf.data.data_reader_writer import S3DataReader, S3DataWriter
@
pytest
.
mark
.
skipif
(
os
.
getenv
(
'S3_ACCESS_KEY'
,
None
)
is
None
,
reason
=
'need s3 config!'
)
def
test_
multi_bucket_
s3_reader_writer
():
def
test_s3_reader_writer
():
"""test multi bucket s3 reader writer must config s3 config in the
environment export S3_BUCKET=xxx export S3_ACCESS_KEY=xxx export
S3_SECRET_KEY=xxx export S3_ENDPOINT=xxx."""
...
...
@@ -18,8 +18,8 @@ def test_multi_bucket_s3_reader_writer():
sk
=
os
.
getenv
(
'S3_SECRET_KEY'
,
''
)
endpoint_url
=
os
.
getenv
(
'S3_ENDPOINT'
,
''
)
reader
=
S3DataReader
(
bucket
=
bucket
,
ak
=
ak
,
sk
=
sk
,
endpoint_url
=
endpoint_url
)
writer
=
S3DataWriter
(
bucket
=
bucket
,
ak
=
ak
,
sk
=
sk
,
endpoint_url
=
endpoint_url
)
reader
=
S3DataReader
(
''
,
bucket
,
ak
,
sk
,
endpoint_url
)
writer
=
S3DataWriter
(
''
,
bucket
,
ak
,
sk
,
endpoint_url
)
bits
=
reader
.
read
(
'meta-index/scihub/v001/scihub/part-66210c190659-000026.jsonl'
)
...
...
@@ -51,3 +51,56 @@ def test_multi_bucket_s3_reader_writer():
assert
'123'
.
encode
()
==
reader
.
read
(
'unittest/data/data_reader_writer/multi_bucket_s3_data/test02.txt'
)
@
pytest
.
mark
.
skipif
(
os
.
getenv
(
'S3_ACCESS_KEY'
,
None
)
is
None
,
reason
=
'need s3 config!'
)
def
test_s3_reader_writer_with_prefix
():
"""test multi bucket s3 reader writer must config s3 config in the
environment export S3_BUCKET=xxx export S3_ACCESS_KEY=xxx export
S3_SECRET_KEY=xxx export S3_ENDPOINT=xxx."""
bucket
=
os
.
getenv
(
'S3_BUCKET'
,
''
)
ak
=
os
.
getenv
(
'S3_ACCESS_KEY'
,
''
)
sk
=
os
.
getenv
(
'S3_SECRET_KEY'
,
''
)
endpoint_url
=
os
.
getenv
(
'S3_ENDPOINT'
,
''
)
prefix
=
'meta-index'
reader
=
S3DataReader
(
prefix
,
bucket
,
ak
,
sk
,
endpoint_url
)
writer
=
S3DataWriter
(
prefix
,
bucket
,
ak
,
sk
,
endpoint_url
)
bits
=
reader
.
read
(
'scihub/v001/scihub/part-66210c190659-000026.jsonl'
)
assert
bits
==
reader
.
read
(
f
's3://
{
bucket
}
/
{
prefix
}
/scihub/v001/scihub/part-66210c190659-000026.jsonl'
)
bits
=
reader
.
read
(
'scihub/v001/scihub/part-66210c190659-000026.jsonl?bytes=566,713'
)
assert
bits
==
reader
.
read_at
(
'scihub/v001/scihub/part-66210c190659-000026.jsonl'
,
566
,
713
)
assert
len
(
json
.
loads
(
bits
))
>
0
writer
.
write_string
(
'unittest/data/data_reader_writer/multi_bucket_s3_data/test01.txt'
,
'abc'
)
assert
'abc'
.
encode
()
==
reader
.
read
(
'unittest/data/data_reader_writer/multi_bucket_s3_data/test01.txt'
)
assert
'abc'
.
encode
()
==
reader
.
read
(
f
's3://
{
bucket
}
/
{
prefix
}
/unittest/data/data_reader_writer/multi_bucket_s3_data/test01.txt'
)
writer
.
write
(
f
'
{
bucket
}
/
{
prefix
}
/unittest/data/data_reader_writer/multi_bucket_s3_data/test02.txt'
,
'123'
.
encode
(),
)
assert
'123'
.
encode
()
==
reader
.
read
(
'unittest/data/data_reader_writer/multi_bucket_s3_data/test02.txt'
)
Prev
1
2
3
Next
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment