README.md 4.37 KB
Newer Older
1
## Installation
icecraft's avatar
icecraft committed
2
3
4
5
6
7
8
9
10
11
12
13

MinerU

```bash
git clone https://github.com/opendatalab/MinerU.git
cd MinerU

conda create -n MinerU python=3.10
conda activate MinerU
pip install .[full] --extra-index-url https://wheels.myhloli.com
```

14
Third-party software
icecraft's avatar
icecraft committed
15
16
17
18
19
20
21
22
23
24
25
26
27
28

```bash
# install
pip install llama-index-vector-stores-elasticsearch==0.2.0
pip install llama-index-embeddings-dashscope==0.2.0
pip install llama-index-core==0.10.68
pip install einops==0.7.0
pip install transformers-stream-generator==0.0.5
pip install accelerate==0.33.0

# uninstall
pip uninstall transformer-engine
```

29
## Environment Configuration
icecraft's avatar
icecraft committed
30
31
32
33
34
35
36

```
export DASHSCOPE_API_KEY={some_key}
export ES_USER={some_es_user}
export ES_PASSWORD={some_es_password}
export ES_URL=http://{es_url}:9200
```
37
For instructions on obtaining a DASHSCOPE_API_KEY, refer to [documentation](https://help.aliyun.com/zh/dashscope/opening-service)
icecraft's avatar
icecraft committed
38

39
## Usage
icecraft's avatar
icecraft committed
40

41
### Data Ingestion
icecraft's avatar
icecraft committed
42
43
44
45
46
47
48
49
50

```bash
python data_ingestion.py -p some.pdf  # load data from pdf

    or

python data_ingestion.py -p /opt/data/some_pdf_directory/ # load data from multiples pdf which under the directory of {some_pdf_directory}
```

51
### Query
icecraft's avatar
icecraft committed
52
53
54
55
56

```bash
python query.py --question '{the_question_you_want_to_ask}'
```

57
## Example
icecraft's avatar
icecraft committed
58
59

````bash
60
# Start the es service
icecraft's avatar
icecraft committed
61
62
63
64
65
66
67
docker compose up -d

or

docker-compose up -d


68
# Set environment variables
icecraft's avatar
icecraft committed
69
70
71
export ES_USER=elastic
export ES_PASSWORD=llama_index
export ES_URL=http://127.0.0.1:9200
icecraft's avatar
icecraft committed
72
export DASHSCOPE_API_KEY={some_key}
icecraft's avatar
icecraft committed
73
74


75
# Ingest data
icecraft's avatar
icecraft committed
76
77
78
python data_ingestion.py example/data/declaration_of_the_rights_of_man_1789.pdf


79
# Ask a question
icecraft's avatar
icecraft committed
80
81
82
python query.py -q 'how about the rights of men'

## outputs
83
Please answer the question based on the content within ```:
icecraft's avatar
icecraft committed
84
85
86
            ```
            I. Men are born, and always continue, free and equal in respect of their rights. Civil distinctions, therefore, can be founded only on public utility.
            ```
87
            My question is:how about the rights of men。
icecraft's avatar
icecraft committed
88
89
90
91
92
93

question: how about the rights of men
answer: The statement implies that men are born free and equal in terms of their rights. Civil distinctions should only be based on public utility. However, it does not specify what those rights are. It is up to society and individual countries to determine and protect the specific rights of their citizens.

````

94
95
96
## Development

`MinerU` provides a `RAG` integration interface, allowing users to specify a single input `pdf` file or a directory. `MinerU` will automatically parse the input files and return an iterable interface for retrieving the data.
icecraft's avatar
icecraft committed
97
98


99
### API Interface
icecraft's avatar
icecraft committed
100
101
102
103
104
105

```python
from magic_pdf.integrations.rag.type import Node

class RagPageReader:
    def get_rel_map(self) -> list[ElementRelation]:
106
        # Retrieve the relationships between nodes
icecraft's avatar
icecraft committed
107
108
109
110
111
112
113
114
115
116
117
        pass
    ...

class RagDocumentReader:
    ...

class DataReader:
    def __init__(self, path_or_directory: str, method: str, output_dir: str):
        pass

    def get_documents_count(self) -> int:
118
        """Get the number of pdf documents"""
icecraft's avatar
icecraft committed
119
120
121
        pass

    def get_document_result(self, idx: int) -> RagDocumentReader | None:
122
        """Retrieve the parsed content of a specific pdf"""
icecraft's avatar
icecraft committed
123
124
125
126
        pass


    def get_document_filename(self, idx: int) -> Path:
127
        """Retrieve the path of a specific pdf"""
icecraft's avatar
icecraft committed
128
129
130
131
132
        pass


```

133
Type Definitions
icecraft's avatar
icecraft committed
134
135
136

```python

137

icecraft's avatar
icecraft committed
138
class Node(BaseModel):
139
140
141
142
143
144
145
146
    category_type: CategoryType = Field(description='Category') # Category
    text: str | None = Field(description='Text content', default=None)
    image_path: str | None = Field(description='Path to image or table (table may be stored as an image)', default=None)
    anno_id: int = Field(description='Unique ID', default=-1)
    latex: str | None = Field(description='LaTeX output for equations or tables', default=None)
    html: str | None = Field(description='HTML output for tables', default=None)


icecraft's avatar
icecraft committed
147
148
149

```

150
151
152
Tables can be stored in one of three formats: image, LaTeX, or HTML. 
`anno_id` is a globally unique ID for each Node. It can be used later to match this Node with other Nodes. The relationships between nodes can be retrieved using the `get_rel_map` method. Users can use `anno_id` to link nodes and construct a RAG index that includes node relationships.

icecraft's avatar
icecraft committed
153

154
### Node Relationship Matrix
icecraft's avatar
icecraft committed
155
156
157
158
159
160

|                | image_body | table_body |
| -------------- | ---------- | ---------- |
| image_caption  | sibling    |            |
| table_caption  |            | sibling    |
| table_footnote |            | sibling    |