read_api.rst 3.06 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

read_api 
==========

Read the content from file or directory to create ``Dataset``, Currently we provided serval functions that cover some scenarios.
if you have new scenarios that is common to most of the users, you can post it on the offical github issues with detail descriptions.
Also it is easy to implement your own read-related funtions.


Important Functions
-------------------


read_jsonl
^^^^^^^^^^^^^^^^

Read the contet from jsonl which may located on local machine or remote s3. if you want to know more about jsonl, please goto :doc:`../../additional_notes/glossary`

.. code:: python

xu rui's avatar
xu rui committed
21
22
23
    from magic_pdf.data.read_api import *
    from magic_pdf.data.data_reader_writer import MultiBucketS3DataReader
    from magic_pdf.data.schemas import S3Config
icecraft's avatar
icecraft committed
24

xu rui's avatar
xu rui committed
25
26
    # read jsonl from local machine
    datasets = read_jsonl("tt.jsonl", None)   # replace with real jsonl file
27
28
29

    # read jsonl from remote s3

xu rui's avatar
xu rui committed
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
    bucket = "bucket_1"                     # replace with real s3 bucket
    ak = "access_key_1"                     # replace with real s3 access key
    sk = "secret_key_1"                     # replace with real s3 secret key
    endpoint_url = "endpoint_url_1"         # replace with real s3 endpoint url

    bucket_2 = "bucket_2"                   # replace with real s3 bucket
    ak_2 = "access_key_2"                   # replace with real s3 access key
    sk_2 = "secret_key_2"                   # replace with real s3 secret key
    endpoint_url_2 = "endpoint_url_2"       # replace with real s3 endpoint url

    s3configs = [
        S3Config(
            bucket_name=bucket, access_key=ak, secret_key=sk, endpoint_url=endpoint_url
        ),
        S3Config(
            bucket_name=bucket_2,
            access_key=ak_2,
            secret_key=sk_2,
            endpoint_url=endpoint_url_2,
        ),
    ]

    s3_reader = MultiBucketS3DataReader(bucket, s3configs)

    datasets = read_jsonl(f"s3://bucket_1/tt.jsonl", s3_reader)  # replace with real s3 jsonl file
55
56

read_local_pdfs
xu rui's avatar
xu rui committed
57
^^^^^^^^^^^^^^^^^
58
59
60
61
62
63

Read pdf from path or directory.


.. code:: python

xu rui's avatar
xu rui committed
64
    from magic_pdf.data.read_api import *
icecraft's avatar
icecraft committed
65

66
67
68
69
70
71
72
73
74
75
76
77
78
79
    # read pdf path
    datasets = read_local_pdfs("tt.pdf")

    # read pdfs under directory
    datasets = read_local_pdfs("pdfs/")


read_local_images
^^^^^^^^^^^^^^^^^^^

Read images from path or directory

.. code:: python 

xu rui's avatar
xu rui committed
80
    from magic_pdf.data.read_api import *
icecraft's avatar
icecraft committed
81

82
    # read from image path 
xu rui's avatar
xu rui committed
83
    datasets = read_local_images("tt.png")  # replace with real file path
84
85

    # read files from directory that endswith suffix in suffixes array 
xu rui's avatar
xu rui committed
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
    datasets = read_local_images("images/", suffixes=[".png", ".jpg"])  # replace with real directory 


read_local_office
^^^^^^^^^^^^^^^^^^^^
Read MS-Office files from path or directory

.. code:: python 

    from magic_pdf.data.read_api import *

    # read from image path 
    datasets = read_local_office("tt.doc")  # replace with real file path

    # read files from directory that endswith suffix in suffixes array 
    datasets = read_local_office("docs/")  # replace with real directory 


104
105
106


Check :doc:`../../api/read_api` for more details