dataset.rst 995 Bytes
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38


Dataset 
===========


Import Classes 
-----------------

Dataset 
^^^^^^^^

Each pdfs or image will form one ``Dataset``. As we all know, Pdf has two categories, :ref:`digital_method_section` or :ref:`ocr_method_section`.
Will get ``ImageDataset`` which is subclass of ``Dataset`` with images and get ``PymuDocDataset`` from pdf files.
The difference between ``ImageDataset`` and ``PymuDocDataset`` is that ``ImageDataset`` only support ``OCR`` parse method, 
while ``PymuDocDataset`` support both ``OCR`` and ``TXT``

.. note::

    In fact some pdf may generated by images, that means it can not support ``TXT`` methods. Currently it is something the user needs to ensure does not happen



Pdf Parse Methods
------------------

.. _ocr_method_section:
OCR 
^^^^
Extract chars via ``Optical Character Recognition`` technical.

.. _digital_method_section:
TXT
^^^^^^^^
Extract chars via third-party library, currently we use ``pymupdf``. 



xu rui's avatar
xu rui committed
39
Check :doc:`../../api/dataset` for more details
40