Reader may curious about the difference between :doc:`io` and this section. Those two sections look very similarity at first glance.
:doc:`io` provides fundamental functions, while This section thinks more at application level. Customer can build they own classes to meet
their own applications need which may share same IO function. That is why we have :doc:`io`.
Important Classes
-----------------
.. code:: python
class FileBasedDataReader(DataReader):
def __init__(self, parent_dir: str = ''):
pass
class FileBasedDataWriter(DataWriter):
def __init__(self, parent_dir: str = '') -> None:
pass
Class ``FileBasedDataReader`` initialized with unary param ``parent_dir``, That means that every method ``FileBasedDataReader`` provided will have features as follow.
Features:
#. read content from the absolute path file, ``parent_dir`` will be ignored.
#. read the relative path, file will first join with ``parent_dir``, then read content from the merged path
.. note::
``FileBasedDataWriter`` shares the same behavior with ``FileBaseDataReader``
class MultiBucketS3DataReader(DataReader, MultiS3Mixin):
pass
All read-related method that class ``MultiBucketS3DataReader`` provided will have features as follow.
Features:
#. read object with full s3-format path, for example ``s3://test_bucket/test_object``, ``default_prefix`` will be ignored.
#. read object with relative path, file will join ``default_prefix`` and trim the ``bucket_name`` firstly, then read the content. ``bucket_name`` is the first element of the result after split ``default_prefix`` with delimiter ``\``
.. note::
``MultiBucketS3DataWriter`` shares the same behavior with ``MultiBucketS3DataReader``
.. code:: python
class S3DataReader(MultiBucketS3DataReader):
pass
``S3DataReader`` is build on top of MultiBucketS3DataReader which only support for bucket. So is ``S3DataWriter``.
Read Examples
------------
.. code:: python
import os
from magic_pdf.data.data_reader_writer import *
from magic_pdf.data.data_reader_writer import MultiBucketS3DataReader
from magic_pdf.data.schemas import S3Config
# file based related
file_based_reader1 = FileBasedDataReader('')
## will read file abc
file_based_reader1.read('abc')
file_based_reader2 = FileBasedDataReader('/tmp')
## will read /tmp/abc
file_based_reader2.read('abc')
## will read /tmp/logs/message.txt
file_based_reader2.read('/tmp/logs/message.txt')
# multi bucket s3 releated
bucket = "bucket" # replace with real bucket
ak = "ak" # replace with real access key
sk = "sk" # replace with real secret key
endpoint_url = "endpoint_url" # replace with real endpoint_url
bucket_2 = "bucket_2" # replace with real bucket
ak_2 = "ak_2" # replace with real access key
sk_2 = "sk_2" # replace with real secret key
endpoint_url_2 = "endpoint_url_2" # replace with real endpoint_url
Each pdfs or image will form one ``Dataset``. As we all know, Pdf has two categories, :ref:`digital_method_section` or :ref:`ocr_method_section`.
Will get ``ImageDataset`` which is subclass of ``Dataset`` with images and get ``PymuDocDataset`` from pdf files.
The difference between ``ImageDataset`` and ``PymuDocDataset`` is that ``ImageDataset`` only support ``OCR`` parse method,
while ``PymuDocDataset`` support both ``OCR`` and ``TXT``
.. note::
In fact some pdf may generated by images, that means it can not support ``TXT`` methods. Currently it is something the user needs to ensure does not happen
Pdf Parse Methods
------------------
.. _ocr_method_section:
OCR
^^^^
Extract chars via ``Optical Character Recognition`` technical.
.. _digital_method_section:
TXT
^^^^^^^^
Extract chars via third-party library, currently we use ``pymupdf``.
Read the content from file or directory to create ``Dataset``, Currently we provided serval functions that cover some scenarios.
if you have new scenarios that is common to most of the users, you can post it on the offical github issues with detail descriptions.
Also it is easy to implement your own read-related funtions.
Important Functions
-------------------
read_jsonl
^^^^^^^^^^^^^^^^
Read the contet from jsonl which may located on local machine or remote s3. if you want to know more about jsonl, please goto :doc:`../../additional_notes/glossary`
.. code:: python
from magic_pdf.data.read_api import *
from magic_pdf.data.data_reader_writer import MultiBucketS3DataReader
from magic_pdf.data.schemas import S3Config
# read jsonl from local machine
datasets = read_jsonl("tt.jsonl", None) # replace with real jsonl file
# read jsonl from remote s3
bucket = "bucket_1" # replace with real s3 bucket
ak = "access_key_1" # replace with real s3 access key
sk = "secret_key_1" # replace with real s3 secret key
endpoint_url = "endpoint_url_1" # replace with real s3 endpoint url
bucket_2 = "bucket_2" # replace with real s3 bucket
ak_2 = "access_key_2" # replace with real s3 access key
sk_2 = "secret_key_2" # replace with real s3 secret key
endpoint_url_2 = "endpoint_url_2" # replace with real s3 endpoint url
Please first navigate to :doc:`tutorial/pipeline` to get an initial understanding of how the pipeline works; this will help in understanding the content of this section.
The **InferenceResult** class is a container for storing model inference results and implements a series of methods related to these results, such as draw_model, dump_model.
Checkout :doc:`../api/model_operators` for more details about **InferenceResult**
Model Inference Result
-----------------------
Structure Definition
^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: python
from pydantic import BaseModel, Field
from enum import IntEnum
class CategoryType(IntEnum):
title = 0 # Title
plain_text = 1 # Text
abandon = 2 # Includes headers, footers, page numbers, and page annotations
figure = 3 # Image
figure_caption = 4 # Image description
table = 5 # Table
table_caption = 6 # Table description
table_footnote = 7 # Table footnote
isolate_formula = 8 # Block formula
formula_caption = 9 # Formula label
embedding = 13 # Inline formula
isolated = 14 # Block formula
text = 15 # OCR recognition result
class PageInfo(BaseModel):
page_no: int = Field(description="Page number, the first page is 0", ge=0)
height: int = Field(description="Page height", gt=0)
width: int = Field(description="Page width", ge=0)
poly: list[float] = Field(description="Quadrilateral coordinates, representing the coordinates of the top-left, top-right, bottom-right, and bottom-left points respectively")
score: float = Field(description="Confidence of the inference result")
If your graphics card has at least **8GB** of VRAM, follow these steps
to test CUDA acceleration:
1. Modify the value of ``"device-mode"`` in the ``magic-pdf.json``
configuration file located in your home directory.
.. code:: json
{
"device-mode": "cuda"
}
2. Test CUDA acceleration with the following command:
.. code:: sh
magic-pdf -p small_ocr.pdf -o ./output
.. _windows_10_or_11_section:
Windows 10/11
--------------
1. Install CUDA
~~~~~~~~~~~~~~~~~~~~~~~~~
You need to install a CUDA version that is compatible with torch's requirements. For details, please refer to the [official PyTorch website](https://pytorch.org/get-started/locally/).
- CUDA 11.8 https://developer.nvidia.com/cuda-11-8-0-download-archive
- CUDA 12.4 https://developer.nvidia.com/cuda-12-4-0-download-archive
- CUDA 12.6 https://developer.nvidia.com/cuda-12-6-0-download-archive
- CUDA 12.8 https://developer.nvidia.com/cuda-12-8-0-download-archive
2. Install Anaconda
~~~~~~~~~~~~~~~~~~~
If Anaconda is already installed, you can skip this step.
If your graphics card has at least 8GB of VRAM, follow these steps to
test CUDA-accelerated parsing performance.
1. **Overwrite the installation of torch and torchvision** supporting CUDA.(Please select the appropriate index-url based on your CUDA version. For more details, refer to the [PyTorch official website](https://pytorch.org/get-started/locally/).)
Since table recognition is a time-consuming process, we set a timeout period. If the process exceeds this time, the table recognition will be terminated.
This section is required for handle **doc**, **docx**, **ppt**, **pptx** filetype, You can **skip** this section if no need for those filetype processing.
Linux/Macos Platform
""""""""""""""""""""""
.. code::
apt-get/yum/brew install libreoffice
Windows Platform
""""""""""""""""""""
.. code::
install libreoffice
append "install_dir\LibreOffice\program" to ENVIRONMENT PATH
.. tip::
The MinerU is installed, Check out :doc:`../usage/command_line` to convert your first pdf **or** reading the following sections for more details about install