README.md 3.36 KB
Newer Older
wanglch's avatar
wanglch committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
# olmOCR-Bench

We develop olmOCR-Bench in order to automatically and effectively evaluate document-level
parsing and OCR of various tools.

olmOCR-Bench works by testing various "facts" or "properties" about document pages at the PDF-level.
We choose PDFs directly, because PDFs do preserve some digital metadata and information which is helpful
and commonly available. Almost any other format can be converted to a PDF, but not the reverse.

## Property classes

- Text presence/absence
 - This task makes sure that a given small piece of text (ex. 1-3 sentence level) is present with high probability within
    a parsed document. It looks at documents with ambiguity around headers, footers, and other ambiguous content. Text still
    has a fuzzy matching allowed.
- Natural Reading Order
 - This task ensures that blocks of text which are present have a defined order relative to one another. For example,
  on a document that contains multiple news articles on one page, you'd want to see that the first sentence of the 
  first article appears after the heading of that article. But, you may be okay with swapping the order of those 
  two articles.
- Table Accuracy
 - Pages with tables get parsed out and are checked for accuracy on a direct row/column/title basis.
- Formula Accuracy
 - Extract formula from document, render it, and compare rendering using foundation model.

Table Format:
 - pdf_filename
 - Task ID
 - Type: text_presence, text_absense, reading_order, table
 - text_presence, text_absense: {text: str, fuzzy_threshold: float}
 - reading_order: {target_text_presence: task_id, appears_before: task_id, appears_after: task_id}
 - table: {table_index: int, needs to be fuzzy as well, ex. does row exist with column text X, does column exist with a row containing Y}
 - formula: TODO

## Creation

We sampled documents from the same source as olmocrmix. We run them through two models, and see which ones have biggest 
plain textual diffs, but still contain lots of good text, and aren't just tables/formula heavy for now.
Then, we will extract text presence/absense markers and verify using tinyhost UI. 
Write those to JSON. Maybe do some embedding and grouping to try to get lots of variation, at least when 
prioritizing manual review.

Later, we will repeat the same for tables and formulas.

Write the evalutor script which will output a nice templated tinyhostable results page.

## Running
We do not want to depend on a model having any specific format of its output.

Step 1. Download dataset with all pdfs (all will be single page) to /pdfs
Step 2. Run your extraction on it, point output to folder, ex. olmocr-v2_1/ where you expect pdf_page1.md for /pdfs/pdf_page1.pdf file
Step 3. Run the evaluation script
Step 4. Get results, and use tinyhost to view all failing examples

### Running existing scripts

```bash
pip install marker-pdf==1.5.4
python olmocr/bench/runners/run_marker.py olmocr/bench/sample_data/pdfs

pip install verovio torchvision
python olmocr/bench/runners/run_gotocr.py olmocr/bench/sample_data/pdfs

conda create -n MinerU python=3.10
conda activate MinerU
pip install -U magic-pdf[full]==1.1.0 --extra-index-url https://wheels.myhloli.com
pip install huggingface_hub
wget https://github.com/opendatalab/MinerU/raw/master/scripts/download_models_hf.py -O download_models_hf.py
python download_models_hf.py
python olmocr/bench/runners/run_mineru.py olmocr/bench/sample_data/pdfs
```