read_api.rst 1.47 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

read_api 
==========

Read the content from file or directory to create ``Dataset``, Currently we provided serval functions that cover some scenarios.
if you have new scenarios that is common to most of the users, you can post it on the offical github issues with detail descriptions.
Also it is easy to implement your own read-related funtions.


Important Functions
-------------------


read_jsonl
^^^^^^^^^^^^^^^^

Read the contet from jsonl which may located on local machine or remote s3. if you want to know more about jsonl, please goto :doc:`../../additional_notes/glossary`

.. code:: python

icecraft's avatar
icecraft committed
21
22
    from magic_pdf.data.io.read_api import *

23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
    # read jsonl from local machine 
    datasets = read_jsonl("tt.jsonl", None)

    # read jsonl from remote s3
    datasets = read_jsonl("s3://bucket_1/tt.jsonl", s3_reader)


read_local_pdfs
^^^^^^^^^^^^^^^^

Read pdf from path or directory.


.. code:: python

icecraft's avatar
icecraft committed
38
39
    from magic_pdf.data.io.read_api import *

40
41
42
43
44
45
46
47
48
49
50
51
52
53
    # read pdf path
    datasets = read_local_pdfs("tt.pdf")

    # read pdfs under directory
    datasets = read_local_pdfs("pdfs/")


read_local_images
^^^^^^^^^^^^^^^^^^^

Read images from path or directory

.. code:: python 

icecraft's avatar
icecraft committed
54
55
    from magic_pdf.data.io.read_api import *

56
57
58
59
60
61
62
63
    # read from image path 
    datasets = read_local_images("tt.png")

    # read files from directory that endswith suffix in suffixes array 
    datasets = read_local_images("images/", suffixes=["png", "jpg"])


Check :doc:`../../api/read_api` for more details