"vscode:/vscode.git/clone" did not exist on "c7bcb0031965e33531358639620a11516d101b54"
pipeline.rst 5.55 KB
Newer Older
xu rui's avatar
xu rui committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185


Pipeline
==========


Minimal Example 
^^^^^^^^^^^^^^^^^

.. code:: python

    import os

    from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader
    from magic_pdf.data.dataset import PymuDocDataset
    from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze

    # args
    pdf_file_name = "abc.pdf"  # replace with the real pdf path
    name_without_suff = pdf_file_name.split(".")[0]

    # prepare env
    local_image_dir, local_md_dir = "output/images", "output"
    image_dir = str(os.path.basename(local_image_dir))

    os.makedirs(local_image_dir, exist_ok=True)

    image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter(
        local_md_dir
    )
    image_dir = str(os.path.basename(local_image_dir))

    # read bytes
    reader1 = FileBasedDataReader("")
    pdf_bytes = reader1.read(pdf_file_name)  # read the pdf content

    # proc
    ## Create Dataset Instance
    ds = PymuDocDataset(pdf_bytes)

    ds.apply(doc_analyze, ocr=True).pipe_ocr_mode(image_writer).dump_md(md_writer, f"{name_without_suff}.md", image_dir)

Running the above code will result in the following


.. code:: bash 

    output/
    ├── abc.md
    └── images


Excluding the setup of the environment, such as creating directories and importing dependencies, the actual code snippet for converting pdf to markdown is as follows


.. code:: python 

    # read bytes
    reader1 = FileBasedDataReader("")
    pdf_bytes = reader1.read(pdf_file_name)  # read the pdf content

    # proc
    ## Create Dataset Instance
    ds = PymuDocDataset(pdf_bytes)

    ds.apply(doc_analyze, ocr=True).pipe_ocr_mode(image_writer).dump_md(md_writer, f"{name_without_suff}.md", image_dir)

``ds.apply(doc_analyze, ocr=True)`` generates an ``InferenceResult`` object. The ``InferenceResult`` object, when executing the ``pipe_ocr_mode`` method, produces a ``PipeResult`` object.
The ``PipeResult`` object, upon executing ``dump_md``, generates a ``markdown`` file at the specified location.


The pipeline execution process is illustrated in the following diagram


.. image:: ../../_static/image/pipeline.drawio.svg 

.. raw:: html

    <br> </br>

Currently, the process is divided into three stages: data, inference, and processing, which correspond to the ``Dataset``, ``InferenceResult``, and ``PipeResult`` entities in the diagram.
These stages are linked together through methods like ``apply``, ``doc_analyze``, or ``pipe_ocr_mode``


.. admonition:: Tip
    :class: tip

    For more examples on how to use ``Dataset``, ``InferenceResult``, and ``PipeResult``, please refer to :doc:`../quick_start/to_markdown`

    For more detailed information about ``Dataset``, ``InferenceResult``, and ``PipeResult``, please refer to :doc:`../../api/dataset`, :doc:`../../api/model_operators`, :doc:`../../api/pipe_operators`


Pipeline Composition
^^^^^^^^^^^^^^^^^^^^^

.. code:: python 

    class Dataset(ABC):
        @abstractmethod
        def apply(self, proc: Callable, *args, **kwargs):
            """Apply callable method which.

            Args:
                proc (Callable): invoke proc as follows:
                    proc(self, *args, **kwargs)

            Returns:
                Any: return the result generated by proc
            """
            pass

    class InferenceResult(InferenceResultBase):

        def apply(self, proc: Callable, *args, **kwargs):
            """Apply callable method which.

            Args:
                proc (Callable): invoke proc as follows:
                    proc(inference_result, *args, **kwargs)

            Returns:
                Any: return the result generated by proc
            """
            return proc(copy.deepcopy(self._infer_res), *args, **kwargs)

        def pipe_ocr_mode(
            self,
            imageWriter: DataWriter,
            start_page_id=0,
            end_page_id=None,
            debug_mode=False,
            lang=None,
            ) -> PipeResult:
            pass

    class PipeResult:
        def apply(self, proc: Callable, *args, **kwargs):
            """Apply callable method which.

            Args:
                proc (Callable): invoke proc as follows:
                    proc(pipeline_result, *args, **kwargs)

            Returns:
                Any: return the result generated by proc
            """
            return proc(copy.deepcopy(self._pipe_res), *args, **kwargs)


The ``Dataset``, ``InferenceResult``, and ``PipeResult`` classes all have an ``apply`` method, which can be used to chain different stages of the computation. 
As shown below, ``MinerU`` provides a set of methods to compose these classes.


.. code:: python 

    # proc
    ## Create Dataset Instance
    ds = PymuDocDataset(pdf_bytes)

    ds.apply(doc_analyze, ocr=True).pipe_ocr_mode(image_writer).dump_md(md_writer, f"{name_without_suff}.md", image_dir)


Users can implement their own functions for chaining as needed. For example, a user could use the ``apply`` method to create a function that counts the number of pages in a ``pdf`` file.


.. code:: python

    from magic_pdf.data.data_reader_writer import  FileBasedDataReader
    from magic_pdf.data.dataset import PymuDocDataset

    # args
    pdf_file_name = "abc.pdf"  # replace with the real pdf path

    # read bytes
    reader1 = FileBasedDataReader("")
    pdf_bytes = reader1.read(pdf_file_name)  # read the pdf content

    # proc
    ## Create Dataset Instance
    ds = PymuDocDataset(pdf_bytes)

    def count_page(ds)-> int:
        return len(ds)

    print("page number: ", ds.apply(count_page)) # will output the page count of `abc.pdf`