Reader may curious about the difference between :doc:`io` and this section. Those two sections look very similarity at first glance.
:doc:`io` provides fundamental functions, while This section thinks more at application level. Customer can build they own classes to meet
their own applications need which may share same IO function. That is why we have :doc:`io`.
Important Classes
-----------------
.. code:: python
class FileBasedDataReader(DataReader):
def __init__(self, parent_dir: str = ''):
pass
class FileBasedDataWriter(DataWriter):
def __init__(self, parent_dir: str = '') -> None:
pass
Class ``FileBasedDataReader`` initialized with unary param ``parent_dir``, That means that every method ``FileBasedDataReader`` provided will have features as follow.
Features:
#. read content from the absolute path file, ``parent_dir`` will be ignored.
#. read the relative path, file will first join with ``parent_dir``, then read content from the merged path
.. note::
``FileBasedDataWriter`` shares the same behavior with ``FileBaseDataReader``
class MultiBucketS3DataReader(DataReader, MultiS3Mixin):
pass
All read-related method that class ``MultiBucketS3DataReader`` provided will have features as follow.
Features:
#. read object with full s3-format path, for example ``s3://test_bucket/test_object``, ``default_prefix`` will be ignored.
#. read object with relative path, file will join ``default_prefix`` and trim the ``bucket_name`` firstly, then read the content. ``bucket_name`` is the first element of the result after split ``default_prefix`` with delimiter ``\``
.. note::
``MultiBucketS3DataWriter`` shares the same behavior with ``MultiBucketS3DataReader``
.. code:: python
class S3DataReader(MultiBucketS3DataReader):
pass
``S3DataReader`` is build on top of MultiBucketS3DataReader which only support for bucket. So is ``S3DataWriter``.
Each pdfs or image will form one ``Dataset``. As we all know, Pdf has two categories, :ref:`digital_method_section` or :ref:`ocr_method_section`.
Will get ``ImageDataset`` which is subclass of ``Dataset`` with images and get ``PymuDocDataset`` from pdf files.
The difference between ``ImageDataset`` and ``PymuDocDataset`` is that ``ImageDataset`` only support ``OCR`` parse method,
while ``PymuDocDataset`` support both ``OCR`` and ``TXT``
.. note::
In fact some pdf may generated by images, that means it can not support ``TXT`` methods. Currently it is something the user needs to ensure does not happen
Pdf Parse Methods
------------------
.. _ocr_method_section:
OCR
^^^^
Extract chars via ``Optical Character Recognition`` technical.
.. _digital_method_section:
TXT
^^^^^^^^
Extract chars via third-party library, currently we use ``pymupdf``.
Check :doc:`../../api/classes` for more intuitions or check :doc:`../../api/dataset` for more details
Read the content from file or directory to create ``Dataset``, Currently we provided serval functions that cover some scenarios.
if you have new scenarios that is common to most of the users, you can post it on the offical github issues with detail descriptions.
Also it is easy to implement your own read-related funtions.
Important Functions
-------------------
read_jsonl
^^^^^^^^^^^^^^^^
Read the contet from jsonl which may located on local machine or remote s3. if you want to know more about jsonl, please goto :doc:`../../additional_notes/glossary`