"python/sglang/srt/managers/controller/tp_worker.py" did not exist on "64fe311593edee917a28506be8723127d4e938c9"
Unverified Commit bcbbee8c authored by Xiaomeng Zhao's avatar Xiaomeng Zhao Committed by GitHub
Browse files

Merge pull request #2622 from myhloli/dev

Dev
parents 3cc3f754 ced5a7b4
.. toctree::
:maxdepth: 2
user_guide/install
user_guide/usage
user_guide/quick_start
user_guide/tutorial
user_guide/data
user_guide/inference_result
user_guide/pipe_result
Data
=========
.. toctree::
:maxdepth: 2
data/dataset
data/read_api
data/data_reader_writer
data/io
Data Reader Writer
====================
Aims for read or write bytes from different media, You can implement new classes to meet the needs of your personal scenarios
if MinerU have not provide the suitable classes. It is easy to implement new classes, the only one requirement is to inherit from
``DataReader`` or ``DataWriter``
.. code:: python
class SomeReader(DataReader):
def read(self, path: str) -> bytes:
pass
def read_at(self, path: str, offset: int = 0, limit: int = -1) -> bytes:
pass
class SomeWriter(DataWriter):
def write(self, path: str, data: bytes) -> None:
pass
def write_string(self, path: str, data: str) -> None:
pass
Reader may curious about the difference between :doc:`io` and this section. Those two sections look very similarity at first glance.
:doc:`io` provides fundamental functions, while This section thinks more at application level. Customer can build they own classes to meet
their own applications need which may share same IO function. That is why we have :doc:`io`.
Important Classes
-----------------
.. code:: python
class FileBasedDataReader(DataReader):
def __init__(self, parent_dir: str = ''):
pass
class FileBasedDataWriter(DataWriter):
def __init__(self, parent_dir: str = '') -> None:
pass
Class ``FileBasedDataReader`` initialized with unary param ``parent_dir``, That means that every method ``FileBasedDataReader`` provided will have features as follow.
Features:
#. read content from the absolute path file, ``parent_dir`` will be ignored.
#. read the relative path, file will first join with ``parent_dir``, then read content from the merged path
.. note::
``FileBasedDataWriter`` shares the same behavior with ``FileBaseDataReader``
.. code:: python
class MultiS3Mixin:
def __init__(self, default_prefix: str, s3_configs: list[S3Config]):
pass
class MultiBucketS3DataReader(DataReader, MultiS3Mixin):
pass
All read-related method that class ``MultiBucketS3DataReader`` provided will have features as follow.
Features:
#. read object with full s3-format path, for example ``s3://test_bucket/test_object``, ``default_prefix`` will be ignored.
#. read object with relative path, file will join ``default_prefix`` and trim the ``bucket_name`` firstly, then read the content. ``bucket_name`` is the first element of the result after split ``default_prefix`` with delimiter ``\``
.. note::
``MultiBucketS3DataWriter`` shares the same behavior with ``MultiBucketS3DataReader``
.. code:: python
class S3DataReader(MultiBucketS3DataReader):
pass
``S3DataReader`` is build on top of MultiBucketS3DataReader which only support for bucket. So is ``S3DataWriter``.
Read Examples
------------
.. code:: python
import os
from magic_pdf.data.data_reader_writer import *
from magic_pdf.data.data_reader_writer import MultiBucketS3DataReader
from magic_pdf.data.schemas import S3Config
# file based related
file_based_reader1 = FileBasedDataReader('')
## will read file abc
file_based_reader1.read('abc')
file_based_reader2 = FileBasedDataReader('/tmp')
## will read /tmp/abc
file_based_reader2.read('abc')
## will read /tmp/logs/message.txt
file_based_reader2.read('/tmp/logs/message.txt')
# multi bucket s3 releated
bucket = "bucket" # replace with real bucket
ak = "ak" # replace with real access key
sk = "sk" # replace with real secret key
endpoint_url = "endpoint_url" # replace with real endpoint_url
bucket_2 = "bucket_2" # replace with real bucket
ak_2 = "ak_2" # replace with real access key
sk_2 = "sk_2" # replace with real secret key
endpoint_url_2 = "endpoint_url_2" # replace with real endpoint_url
test_prefix = 'test/unittest'
multi_bucket_s3_reader1 = MultiBucketS3DataReader(f"{bucket}/{test_prefix}", [S3Config(
bucket_name=bucket, access_key=ak, secret_key=sk, endpoint_url=endpoint_url
),
S3Config(
bucket_name=bucket_2,
access_key=ak_2,
secret_key=sk_2,
endpoint_url=endpoint_url_2,
)])
## will read s3://{bucket}/{test_prefix}/abc
multi_bucket_s3_reader1.read('abc')
## will read s3://{bucket}/{test_prefix}/efg
multi_bucket_s3_reader1.read(f's3://{bucket}/{test_prefix}/efg')
## will read s3://{bucket2}/{test_prefix}/abc
multi_bucket_s3_reader1.read(f's3://{bucket_2}/{test_prefix}/abc')
# s3 related
s3_reader1 = S3DataReader(
test_prefix,
bucket,
ak,
sk,
endpoint_url
)
## will read s3://{bucket}/{test_prefix}/abc
s3_reader1.read('abc')
## will read s3://{bucket}/efg
s3_reader1.read(f's3://{bucket}/efg')
Write Examples
---------------
.. code:: python
import os
from magic_pdf.data.data_reader_writer import *
from magic_pdf.data.data_reader_writer import MultiBucketS3DataWriter
from magic_pdf.data.schemas import S3Config
# file based related
file_based_writer1 = FileBasedDataWriter("")
## will write 123 to abc
file_based_writer1.write("abc", "123".encode())
## will write 123 to abc
file_based_writer1.write_string("abc", "123")
file_based_writer2 = FileBasedDataWriter("/tmp")
## will write 123 to /tmp/abc
file_based_writer2.write_string("abc", "123")
## will write 123 to /tmp/logs/message.txt
file_based_writer2.write_string("/tmp/logs/message.txt", "123")
# multi bucket s3 releated
bucket = "bucket" # replace with real bucket
ak = "ak" # replace with real access key
sk = "sk" # replace with real secret key
endpoint_url = "endpoint_url" # replace with real endpoint_url
bucket_2 = "bucket_2" # replace with real bucket
ak_2 = "ak_2" # replace with real access key
sk_2 = "sk_2" # replace with real secret key
endpoint_url_2 = "endpoint_url_2" # replace with real endpoint_url
test_prefix = "test/unittest"
multi_bucket_s3_writer1 = MultiBucketS3DataWriter(
f"{bucket}/{test_prefix}",
[
S3Config(
bucket_name=bucket, access_key=ak, secret_key=sk, endpoint_url=endpoint_url
),
S3Config(
bucket_name=bucket_2,
access_key=ak_2,
secret_key=sk_2,
endpoint_url=endpoint_url_2,
),
],
)
## will write 123 to s3://{bucket}/{test_prefix}/abc
multi_bucket_s3_writer1.write_string("abc", "123")
## will write 123 to s3://{bucket}/{test_prefix}/abc
multi_bucket_s3_writer1.write("abc", "123".encode())
## will write 123 to s3://{bucket}/{test_prefix}/efg
multi_bucket_s3_writer1.write(f"s3://{bucket}/{test_prefix}/efg", "123".encode())
## will write 123 to s3://{bucket_2}/{test_prefix}/abc
multi_bucket_s3_writer1.write(f's3://{bucket_2}/{test_prefix}/abc', '123'.encode())
# s3 related
s3_writer1 = S3DataWriter(test_prefix, bucket, ak, sk, endpoint_url)
## will write 123 to s3://{bucket}/{test_prefix}/abc
s3_writer1.write("abc", "123".encode())
## will write 123 to s3://{bucket}/{test_prefix}/abc
s3_writer1.write_string("abc", "123")
## will write 123 to s3://{bucket}/efg
s3_writer1.write(f"s3://{bucket}/efg", "123".encode())
Check :doc:`../../api/data_reader_writer` for more details
Dataset
===========
Import Classes
-----------------
Dataset
^^^^^^^^
Each pdfs or image will form one ``Dataset``. As we all know, Pdf has two categories, :ref:`digital_method_section` or :ref:`ocr_method_section`.
Will get ``ImageDataset`` which is subclass of ``Dataset`` with images and get ``PymuDocDataset`` from pdf files.
The difference between ``ImageDataset`` and ``PymuDocDataset`` is that ``ImageDataset`` only support ``OCR`` parse method,
while ``PymuDocDataset`` support both ``OCR`` and ``TXT``
.. note::
In fact some pdf may generated by images, that means it can not support ``TXT`` methods. Currently it is something the user needs to ensure does not happen
Pdf Parse Methods
------------------
.. _ocr_method_section:
OCR
^^^^
Extract chars via ``Optical Character Recognition`` technical.
.. _digital_method_section:
TXT
^^^^^^^^
Extract chars via third-party library, currently we use ``pymupdf``.
Check :doc:`../../api/dataset` for more details
IO
===
Aims for read or write bytes from different media, Currently We provide ``S3Reader``, ``S3Writer`` for AWS S3 compatible media
and ``HttpReader``, ``HttpWriter`` for remote Http file. You can implement new classes to meet the needs of your personal scenarios
if MinerU have not provide the suitable classes. It is easy to implement new classes, the only one requirement is to inherit from
``IOReader`` or ``IOWriter``
.. code:: python
class SomeReader(IOReader):
def read(self, path: str) -> bytes:
pass
def read_at(self, path: str, offset: int = 0, limit: int = -1) -> bytes:
pass
class SomeWriter(IOWriter):
def write(self, path: str, data: bytes) -> None:
pass
Check :doc:`../../api/io` for more details
read_api
==========
Read the content from file or directory to create ``Dataset``, Currently we provided serval functions that cover some scenarios.
if you have new scenarios that is common to most of the users, you can post it on the offical github issues with detail descriptions.
Also it is easy to implement your own read-related funtions.
Important Functions
-------------------
read_jsonl
^^^^^^^^^^^^^^^^
Read the contet from jsonl which may located on local machine or remote s3. if you want to know more about jsonl, please goto :doc:`../../additional_notes/glossary`
.. code:: python
from magic_pdf.data.read_api import *
from magic_pdf.data.data_reader_writer import MultiBucketS3DataReader
from magic_pdf.data.schemas import S3Config
# read jsonl from local machine
datasets = read_jsonl("tt.jsonl", None) # replace with real jsonl file
# read jsonl from remote s3
bucket = "bucket_1" # replace with real s3 bucket
ak = "access_key_1" # replace with real s3 access key
sk = "secret_key_1" # replace with real s3 secret key
endpoint_url = "endpoint_url_1" # replace with real s3 endpoint url
bucket_2 = "bucket_2" # replace with real s3 bucket
ak_2 = "access_key_2" # replace with real s3 access key
sk_2 = "secret_key_2" # replace with real s3 secret key
endpoint_url_2 = "endpoint_url_2" # replace with real s3 endpoint url
s3configs = [
S3Config(
bucket_name=bucket, access_key=ak, secret_key=sk, endpoint_url=endpoint_url
),
S3Config(
bucket_name=bucket_2,
access_key=ak_2,
secret_key=sk_2,
endpoint_url=endpoint_url_2,
),
]
s3_reader = MultiBucketS3DataReader(bucket, s3configs)
datasets = read_jsonl(f"s3://bucket_1/tt.jsonl", s3_reader) # replace with real s3 jsonl file
read_local_pdfs
^^^^^^^^^^^^^^^^^
Read pdf from path or directory.
.. code:: python
from magic_pdf.data.read_api import *
# read pdf path
datasets = read_local_pdfs("tt.pdf")
# read pdfs under directory
datasets = read_local_pdfs("pdfs/")
read_local_images
^^^^^^^^^^^^^^^^^^^
Read images from path or directory
.. code:: python
from magic_pdf.data.read_api import *
# read from image path
datasets = read_local_images("tt.png") # replace with real file path
# read files from directory that endswith suffix in suffixes array
datasets = read_local_images("images/", suffixes=[".png", ".jpg"]) # replace with real directory
read_local_office
^^^^^^^^^^^^^^^^^^^^
Read MS-Office files from path or directory
.. code:: python
from magic_pdf.data.read_api import *
# read from image path
datasets = read_local_office("tt.doc") # replace with real file path
# read files from directory that endswith suffix in suffixes array
datasets = read_local_office("docs/") # replace with real directory
Check :doc:`../../api/read_api` for more details
\ No newline at end of file
Inference Result
==================
.. admonition:: Tip
:class: tip
Please first navigate to :doc:`tutorial/pipeline` to get an initial understanding of how the pipeline works; this will help in understanding the content of this section.
The **InferenceResult** class is a container for storing model inference results and implements a series of methods related to these results, such as draw_model, dump_model.
Checkout :doc:`../api/model_operators` for more details about **InferenceResult**
Model Inference Result
-----------------------
Structure Definition
^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: python
from pydantic import BaseModel, Field
from enum import IntEnum
class CategoryType(IntEnum):
title = 0 # Title
plain_text = 1 # Text
abandon = 2 # Includes headers, footers, page numbers, and page annotations
figure = 3 # Image
figure_caption = 4 # Image description
table = 5 # Table
table_caption = 6 # Table description
table_footnote = 7 # Table footnote
isolate_formula = 8 # Block formula
formula_caption = 9 # Formula label
embedding = 13 # Inline formula
isolated = 14 # Block formula
text = 15 # OCR recognition result
class PageInfo(BaseModel):
page_no: int = Field(description="Page number, the first page is 0", ge=0)
height: int = Field(description="Page height", gt=0)
width: int = Field(description="Page width", ge=0)
class ObjectInferenceResult(BaseModel):
category_id: CategoryType = Field(description="Category", ge=0)
poly: list[float] = Field(description="Quadrilateral coordinates, representing the coordinates of the top-left, top-right, bottom-right, and bottom-left points respectively")
score: float = Field(description="Confidence of the inference result")
latex: str | None = Field(description="LaTeX parsing result", default=None)
html: str | None = Field(description="HTML parsing result", default=None)
class PageInferenceResults(BaseModel):
layout_dets: list[ObjectInferenceResult] = Field(description="Page recognition results", ge=0)
page_info: PageInfo = Field(description="Page metadata")
Example
^^^^^^^^^^^
.. code:: json
[
{
"layout_dets": [
{
"category_id": 2,
"poly": [
99.1906967163086,
100.3119125366211,
730.3707885742188,
100.3119125366211,
730.3707885742188,
245.81326293945312,
99.1906967163086,
245.81326293945312
],
"score": 0.9999997615814209
}
],
"page_info": {
"page_no": 0,
"height": 2339,
"width": 1654
}
},
{
"layout_dets": [
{
"category_id": 5,
"poly": [
99.13092803955078,
2210.680419921875,
497.3183898925781,
2210.680419921875,
497.3183898925781,
2264.78076171875,
99.13092803955078,
2264.78076171875
],
"score": 0.9999997019767761
}
],
"page_info": {
"page_no": 1,
"height": 2339,
"width": 1654
}
}
]
The format of the poly coordinates is [x0, y0, x1, y1, x2, y2, x3, y3],
representing the coordinates of the top-left, top-right, bottom-right,
and bottom-left points respectively. |Poly Coordinate Diagram|
Inference Result
-------------------------
.. code:: python
from magic_pdf.operators.models import InferenceResult
from magic_pdf.data.dataset import Dataset
dataset : Dataset = some_data_set # not real dataset
# The inference results of all pages, ordered by page number, are stored in a list as the inference results of MinerU
model_inference_result: list[PageInferenceResults] = []
Inference_result = InferenceResult(model_inference_result, dataset)
some_model.pdf
^^^^^^^^^^^^^^^^^^^^
.. figure:: ../_static/image/inference_result.png
.. |Poly Coordinate Diagram| image:: ../_static/image/poly.png
Installation
==============
.. toctree::
:maxdepth: 1
install/install
install//boost_with_cuda
install/download_model_weight_files
install/config
Boost With Cuda
================
If your device supports CUDA and meets the GPU requirements of the
mainline environment, you can use GPU acceleration. Please select the
appropriate guide based on your system:
- :ref:`ubuntu_22_04_lts_section`
- :ref:`windows_10_or_11_section`
.. _ubuntu_22_04_lts_section:
Ubuntu 22.04 LTS
-----------------
1. Check if NVIDIA Drivers Are Installed
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. code:: sh
nvidia-smi
If you see information similar to the following, it means that the
NVIDIA drivers are already installed, and you can skip Step 2.
.. note::
``CUDA Version`` should be >= 12.4, If the displayed version number is less than 12.4, please upgrade the driver.
.. code:: text
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.07 Driver Version: 572.83 CUDA Version: 12.8 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3060 Ti WDDM | 00000000:01:00.0 On | N/A |
| 0% 51C P8 12W / 200W | 1489MiB / 8192MiB | 5% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
2. Install the Driver
~~~~~~~~~~~~~~~~~~~~~
If no driver is installed, use the following command:
.. code:: sh
sudo apt-get update
sudo apt-get install nvidia-driver-570-server
Install the proprietary driver and restart your computer after
installation.
.. code:: sh
reboot
3. Install Anaconda
~~~~~~~~~~~~~~~~~~~
If Anaconda is already installed, skip this step.
.. code:: sh
wget https://repo.anaconda.com/archive/Anaconda3-2024.06-1-Linux-x86_64.sh
bash Anaconda3-2024.06-1-Linux-x86_64.sh
In the final step, enter ``yes``, close the terminal, and reopen it.
4. Create an Environment Using Conda
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Specify Python version 3.10~3.13.
.. code:: sh
conda create -n mineru 'python=3.12' -y
conda activate mineru
5. Install Applications
~~~~~~~~~~~~~~~~~~~~~~~
.. code:: sh
pip install -U magic-pdf[full]
.. admonition:: TIP
:class: tip
After installation, you can check the version of ``magic-pdf`` using the following command:
.. code:: sh
magic-pdf --version
6. Download Models
~~~~~~~~~~~~~~~~~~
Refer to detailed instructions on :doc:`download_model_weight_files`
7. Understand the Location of the Configuration File
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
After completing the `6. Download Models <#6-download-models>`__ step,
the script will automatically generate a ``magic-pdf.json`` file in the
user directory and configure the default model path. You can find the
``magic-pdf.json`` file in your user directory.
.. admonition:: TIP
:class: tip
The user directory for Linux is “/home/username”.
8. First Run
~~~~~~~~~~~~
Download a sample file from the repository and test it.
.. code:: sh
wget https://github.com/opendatalab/MinerU/raw/master/demo/pdfs/small_ocr.pdf
magic-pdf -p small_ocr.pdf -o ./output
9. Test CUDA Acceleration
~~~~~~~~~~~~~~~~~~~~~~~~~
If your graphics card has at least **8GB** of VRAM, follow these steps
to test CUDA acceleration:
1. Modify the value of ``"device-mode"`` in the ``magic-pdf.json``
configuration file located in your home directory.
.. code:: json
{
"device-mode": "cuda"
}
2. Test CUDA acceleration with the following command:
.. code:: sh
magic-pdf -p small_ocr.pdf -o ./output
.. _windows_10_or_11_section:
Windows 10/11
--------------
1. Install CUDA
~~~~~~~~~~~~~~~~~~~~~~~~~
You need to install a CUDA version that is compatible with torch's requirements. For details, please refer to the [official PyTorch website](https://pytorch.org/get-started/locally/).
- CUDA 11.8 https://developer.nvidia.com/cuda-11-8-0-download-archive
- CUDA 12.4 https://developer.nvidia.com/cuda-12-4-0-download-archive
- CUDA 12.6 https://developer.nvidia.com/cuda-12-6-0-download-archive
- CUDA 12.8 https://developer.nvidia.com/cuda-12-8-0-download-archive
2. Install Anaconda
~~~~~~~~~~~~~~~~~~~
If Anaconda is already installed, you can skip this step.
Download link: https://repo.anaconda.com/archive/Anaconda3-2024.06-1-Windows-x86_64.exe
3. Create an Environment Using Conda
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
::
conda create -n mineru 'python=3.12' -y
conda activate mineru
4. Install Applications
~~~~~~~~~~~~~~~~~~~~~~~
::
pip install -U magic-pdf[full]
.. admonition:: Tip
:class: tip
After installation, you can check the version of ``magic-pdf``:
.. code:: bash
magic-pdf --version
5. Download Models
~~~~~~~~~~~~~~~~~~
Refer to detailed instructions on :doc:`download_model_weight_files`
6. Understand the Location of the Configuration File
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
After completing the `5. Download Models <#5-download-models>`__ step,
the script will automatically generate a ``magic-pdf.json`` file in the
user directory and configure the default model path. You can find the
``magic-pdf.json`` file in your 【user directory】 .
.. admonition:: Tip
:class: tip
The user directory for Windows is “C:/Users/username”.
7. First Run
~~~~~~~~~~~~
Download a sample file from the repository and test it.
.. code:: powershell
wget https://github.com/opendatalab/MinerU/raw/master/demo/pdfs/small_ocr.pdf -O small_ocr.pdf
magic-pdf -p small_ocr.pdf -o ./output
8. Test CUDA Acceleration
~~~~~~~~~~~~~~~~~~~~~~~~~
If your graphics card has at least 8GB of VRAM, follow these steps to
test CUDA-accelerated parsing performance.
1. **Overwrite the installation of torch and torchvision** supporting CUDA.(Please select the appropriate index-url based on your CUDA version. For more details, refer to the [PyTorch official website](https://pytorch.org/get-started/locally/).)
.. code:: sh
pip install --force-reinstall torch torchvision --index-url https://download.pytorch.org/whl/cu124
2. **Modify the value of ``"device-mode"``** in the ``magic-pdf.json``
configuration file located in your user directory.
.. code:: json
{
"device-mode": "cuda"
}
3. **Run the following command to test CUDA acceleration**:
::
magic-pdf -p small_ocr.pdf -o ./output
This diff is collapsed.
Download Model Weight Files
==============================
Model downloads are divided into initial downloads and updates to the
model directory. Please refer to the corresponding documentation for
instructions on how to proceed.
Initial download of model files
------------------------------
1. Download the Model from Hugging Face
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Use a Python Script to Download Model Files from Hugging Face
.. code:: bash
pip install huggingface_hub
wget https://github.com/opendatalab/MinerU/raw/master/scripts/download_models_hf.py -O download_models_hf.py
python download_models_hf.py
The Python script will automatically download the model files and
configure the model directory in the configuration file.
The configuration file can be found in the user directory, with the
filename ``magic-pdf.json``.
How to update models previously downloaded
-----------------------------------------
1. Models downloaded via Hugging Face or Model Scope
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If you previously downloaded models via Hugging Face or Model Scope, you
can rerun the Python script used for the initial download. This will
automatically update the model directory to the latest version.
This diff is collapsed.
This diff is collapsed.
Quick Start
==============
Want to learn about the usage methods under different scenarios ? This page gives good examples about multiple usage cases match your needs.
.. toctree::
:maxdepth: 1
quick_start/convert_pdf
quick_start/convert_image
quick_start/convert_ms_office
Convert Image
===============
Command Line
^^^^^^^^^^^^^
.. code:: python
# make sure the file have correct suffix
magic-pdf -p a.png -o output -m auto
API
^^^^^^
.. code:: python
import os
from magic_pdf.data.data_reader_writer import FileBasedDataWriter
from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
from magic_pdf.data.read_api import read_local_images
# prepare env
local_image_dir, local_md_dir = "output/images", "output"
image_dir = str(os.path.basename(local_image_dir))
os.makedirs(local_image_dir, exist_ok=True)
image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter(
local_md_dir
)
# proc
## Create Dataset Instance
input_file = "some_image.jpg" # replace with real image file
input_file_name = input_file.split(".")[0]
ds = read_local_images(input_file)[0]
# ocr mode
ds.apply(doc_analyze, ocr=True).pipe_ocr_mode(image_writer).dump_md(
md_writer, f"{input_file_name}.md", image_dir
)
Convert Doc
=============
.. admonition:: Warning
:class: tip
When processing MS-Office files, we first use third-party software to convert the MS-Office files to PDF.
For certain MS-Office files, the quality of the converted PDF files may not be very high, which can affect the quality of the final output.
Command Line
^^^^^^^^^^^^^
.. code:: python
# replace with real ms-office file, we support MS-DOC, MS-DOCX, MS-PPT, MS-PPTX now
magic-pdf -p a.doc -o output -m auto
API
^^^^^^^^
.. code:: python
import os
from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader
from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
from magic_pdf.data.read_api import read_local_office
from magic_pdf.config.enums import SupportedPdfParseMethod
# prepare env
local_image_dir, local_md_dir = "output/images", "output"
image_dir = str(os.path.basename(local_image_dir))
os.makedirs(local_image_dir, exist_ok=True)
image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter(
local_md_dir
)
# proc
## Create Dataset Instance
input_file = "some_doc.doc" # replace with real ms-office file, we support MS-DOC, MS-DOCX, MS-PPT, MS-PPTX now
input_file_name = input_file.split(".")[0]
ds = read_local_office(input_file)[0]
## inference
if ds.classify() == SupportedPdfParseMethod.OCR:
ds.apply(doc_analyze, ocr=True).pipe_ocr_mode(image_writer).dump_md(
md_writer, f"{input_file_name}.md", image_dir)
else:
ds.apply(doc_analyze, ocr=False).pipe_txt_mode(image_writer).dump_md(
md_writer, f"{input_file_name}.md", image_dir)
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment