### 1. When using the command `pip install magic-pdf[full]` on newer versions of macOS, the error `zsh: no matches found: magic-pdf[full]` occurs.
### 1. Encountered the error `ImportError: libGL.so.1: cannot open shared object file: No such file or directory` in Ubuntu 22.04 on WSL2
On macOS, the default shell has switched from Bash to Z shell, which has special handling logic for certain types of string matching. This can lead to the "no matches found" error. You can try disabling the globbing feature in the command line and then run the installation command again.
```bash
setopt no_nomatch
pip install magic-pdf[full]
```
### 2. Encountering the error `pickle.UnpicklingError: invalid load key, 'v'.` during use
This might be due to an incomplete download of the model file. You can try re-downloading the model file and then try again.
### 7. On some Linux servers, the program immediately reports an error `Illegal instruction (core dumped)`
This might be because the server's CPU does not support the AVX/AVX2 instruction set, or the CPU itself supports it but has been disabled by the system administrator. You can try contacting the system administrator to remove the restriction or change to a different server.
### 8. Error when installing MinerU on CentOS 7 or Ubuntu 18: `ERROR: Failed building wheel for simsimd`
The new version of albumentations (1.4.21) introduces a dependency on simsimd. Since the pre-built package of simsimd for Linux requires a glibc version greater than or equal to 2.28, this causes installation issues on some Linux distributions released before 2019. You can resolve this issue by using the following command:
The new version of albumentations (1.4.21) introduces a dependency on simsimd. Since the pre-built package of simsimd for Linux requires a glibc version greater than or equal to 2.28, this causes installation issues on some Linux distributions released before 2019. You can resolve this issue by using the following command:
### 9. Old Graphics Cards Such as M40 Encounter "RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED"
An error occurs during operation (cuda):
```
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasGemmStridedBatchedEx(handle, opa, opb, (int)m, (int)n, (int)k, (void*)&falpha, a, CUDA_R_16BF, (int)lda, stridea, b, CUDA_R_16BF, (int)ldb, strideb, (void*)&fbeta, c, CUDA_R_16BF, (int)ldc, stridec, (int)num_batches, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)
```
Because BF16 precision is not supported on graphics cards before the Turing architecture and some graphics cards are not recognized by torch, it is necessary to manually disable BF16 precision.
Modify the code in lines 287-290 of the "pdf_parse_union_core_v2.py" file (note that the location may vary in different versions):
After executing the `magic-pdf` command, in addition to outputting files related to markdown, several other files unrelated to markdown will also be generated. These files will be introduced one by one.
After executing the `mineru` command, in addition to outputting files related to markdown, several other files unrelated to markdown will also be generated. These files will be introduced one by one.
### some_pdf_layout.pdf
### some_pdf_layout.pdf
Each page layout consists of one or more boxes. The number at the top left of each box indicates its sequence number. Additionally, in `layout.pdf`, different content blocks are highlighted with different background colors.
Each page's layout consists of one or more bounding boxes. The number in the top-right corner of each box indicates the reading order. Additionally, different content blocks are highlighted with distinct background colors within the layout.pdf.


### some_pdf_spans.pdf
### some_pdf_spans.pdf(Applicable only to the pipeline backend)
All spans on the page are drawn with different colored line frames according to the span type. This file can be used for quality control, allowing for quick identification of issues such as missing text or unrecognized inline formulas.
All spans on the page are drawn with different colored line frames according to the span type. This file can be used for quality control, allowing for quick identification of issues such as missing text or unrecognized inline formulas.


### some_pdf_model.json
### some_pdf_model.json(Applicable only to the pipeline backend)
#### Structure Definition
#### Structure Definition
...
@@ -117,13 +116,39 @@ The format of the poly coordinates is \[x0, y0, x1, y1, x2, y2, x3, y3\], repres
...
@@ -117,13 +116,39 @@ The format of the poly coordinates is \[x0, y0, x1, y1, x2, y2, x3, y3\], repres
]
]
```
```
### some_pdf_model_output.txt (Applicable only to the VLM backend)
This file contains the output of the VLM model, with each page's output separated by `----`.
Each page's output consists of text blocks starting with `<|box_start|>` and ending with `<|md_end|>`.
The meaning of each field is as follows:
-`<|box_start|>x0 y0 x1 y1<|box_end|>`
x0 y0 x1 y1 represent the coordinates of a quadrilateral, indicating the top-left and bottom-right points. The values are based on a normalized page size of 1000x1000.
-`<|ref_start|>type<|ref_end|>`
`type` indicates the block type. Possible values are:
```json
{
"text":"Text",
"title":"Title",
"image":"Image",
"image_caption":"Image Caption",
"image_footnote":"Image Footnote",
"table":"Table",
"table_caption":"Table Caption",
"table_footnote":"Table Footnote",
"equation":"Interline Equation"
}
```
-`<|md_start|>Markdown content<|md_end|>`
This field contains the Markdown content of the block. If `type` is `text`, the end of the text may contain the `<|txt_contd|>` tag, indicating that this block can be connected with the following `text` block(s).
If `type` is `table`, the content is in `otsl` format and needs to be converted into HTML for rendering in Markdown.
This file is a JSON array where each element is a dict storing all readable content blocks in the document in reading order.
`content_list` can be viewed as a simplified version of `middle.json`. The content block types are mostly consistent with those in `middle.json`, but layout information is not included.
The content has the following types:
| type | desc |
|:---------|:--------------|
| image | Image |
| table | Table |
| text | Text / Title |
| equation | Block formula |
Please note that both `title` and text blocks in `content_list` are uniformly represented using the text type. The `text_level` field is used to distinguish the hierarchy of text blocks:
- A block without the `text_level` field or with `text_level=0` represents body text.
- A block with `text_level=1` represents a level-1 heading.
- A block with `text_level=2` represents a level-2 heading, and so on.
Each content contains the `page_idx` field, indicating the page number (starting from 0) where the content block resides.
#### example
```json
[
{
"type":"text",
"text":"The response of flow duration curves to afforestation ",
"text_level":1,
"page_idx":0
},
{
"type":"text",
"text":"Received 1 October 2003; revised 22 December 2004; accepted 3 January 2005 ",
"page_idx":0
},
{
"type":"text",
"text":"Abstract ",
"text_level":2,
"page_idx":0
},
{
"type":"text",
"text":"The hydrologic effect of replacing pasture or other short crops with trees is reasonably well understood on a mean annual basis. The impact on flow regime, as described by the annual flow duration curve (FDC) is less certain. A method to assess the impact of plantation establishment on FDCs was developed. The starting point for the analyses was the assumption that rainfall and vegetation age are the principal drivers of evapotranspiration. A key objective was to remove the variability in the rainfall signal, leaving changes in streamflow solely attributable to the evapotranspiration of the plantation. A method was developed to (1) fit a model to the observed annual time series of FDC percentiles; i.e. 10th percentile for each year of record with annual rainfall and plantation age as parameters, (2) replace the annual rainfall variation with the long term mean to obtain climate adjusted FDCs, and (3) quantify changes in FDC percentiles as plantations age. Data from 10 catchments from Australia, South Africa and New Zealand were used. The model was able to represent flow variation for the majority of percentiles at eight of the 10 catchments, particularly for the 10–50th percentiles. The adjusted FDCs revealed variable patterns in flow reductions with two types of responses (groups) being identified. Group 1 catchments show a substantial increase in the number of zero flow days, with low flows being more affected than high flows. Group 2 catchments show a more uniform reduction in flows across all percentiles. The differences may be partly explained by storage characteristics. The modelled flow reductions were in accord with published results of paired catchment experiments. An additional analysis was performed to characterise the impact of afforestation on the number of zero flow days $( N _ { \\mathrm { z e r o } } )$ for the catchments in group 1. This model performed particularly well, and when adjusted for climate, indicated a significant increase in $N _ { \\mathrm { z e r o } }$ . The zero flow day method could be used to determine change in the occurrence of any given flow in response to afforestation. The methods used in this study proved satisfactory in removing the rainfall variability, and have added useful insight into the hydrologic impacts of plantation establishment. This approach provides a methodology for understanding catchment response to afforestation, where paired catchment data is not available. ",
"Table 2 Significance of the rainfall and time terms "
],
"table_footnote":[
"indicates that the rainfall term was significant at the $5 \\%$ level, $T$ indicates that the time term was significant at the $5 \\%$ level, \\* represents significance at the $10 \\%$ level, and na denotes too few data points for meaningful analysis. "
"text":"The response of flow duration curves to afforestation ",
"text_level":1,
"page_idx":0
},
{
"type":"text",
"text":"Received 1 October 2003; revised 22 December 2004; accepted 3 January 2005 ",
"page_idx":0
},
{
"type":"text",
"text":"Abstract ",
"text_level":2,
"page_idx":0
},
{
"type":"text",
"text":"The hydrologic effect of replacing pasture or other short crops with trees is reasonably well understood on a mean annual basis. The impact on flow regime, as described by the annual flow duration curve (FDC) is less certain. A method to assess the impact of plantation establishment on FDCs was developed. The starting point for the analyses was the assumption that rainfall and vegetation age are the principal drivers of evapotranspiration. A key objective was to remove the variability in the rainfall signal, leaving changes in streamflow solely attributable to the evapotranspiration of the plantation. A method was developed to (1) fit a model to the observed annual time series of FDC percentiles; i.e. 10th percentile for each year of record with annual rainfall and plantation age as parameters, (2) replace the annual rainfall variation with the long term mean to obtain climate adjusted FDCs, and (3) quantify changes in FDC percentiles as plantations age. Data from 10 catchments from Australia, South Africa and New Zealand were used. The model was able to represent flow variation for the majority of percentiles at eight of the 10 catchments, particularly for the 10–50th percentiles. The adjusted FDCs revealed variable patterns in flow reductions with two types of responses (groups) being identified. Group 1 catchments show a substantial increase in the number of zero flow days, with low flows being more affected than high flows. Group 2 catchments show a more uniform reduction in flows across all percentiles. The differences may be partly explained by storage characteristics. The modelled flow reductions were in accord with published results of paired catchment experiments. An additional analysis was performed to characterise the impact of afforestation on the number of zero flow days $( N _ { \\mathrm { z e r o } } )$ for the catchments in group 1. This model performed particularly well, and when adjusted for climate, indicated a significant increase in $N _ { \\mathrm { z e r o } }$ . The zero flow day method could be used to determine change in the occurrence of any given flow in response to afforestation. The methods used in this study proved satisfactory in removing the rainfall variability, and have added useful insight into the hydrologic impacts of plantation establishment. This approach provides a methodology for understanding catchment response to afforestation, where paired catchment data is not available. ",
"Table 2 Significance of the rainfall and time terms "
],
"table_footnote":[
"indicates that the rainfall term was significant at the $5 \\%$ level, $T$ indicates that the time term was significant at the $5 \\%$ level, \\* represents significance at the $10 \\%$ level, and na denotes too few data points for meaningful analysis. "