README.md 7.33 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
# MMMU Benchmark

### Paper

Title: `MMMU: A Massive Multi-discipline MultimodalUnderstanding and Reasoning Benchmark for Expert AGI`

Abstract: `MMMU is a new benchmark designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning.`

`The benchmark is composed of 30 tasks, for a total of 900 mixed image+text examples (some with multiple images in context)`

Homepage: `https://github.com/MMMU-Benchmark/MMMU/tree/main/mmmu`

Note: Some questions have multiple images in context. To control for this use `max_images=N` in model init.

### Citation

```
@inproceedings{yue2023mmmu,
            title={MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI},
            author={Xiang Yue and Yuansheng Ni and Kai Zhang and Tianyu Zheng and Ruoqi Liu and Ge Zhang and Samuel Stevens and Dongfu Jiang and Weiming Ren and Yuxuan Sun and Cong Wei and Botao Yu and Ruibin Yuan and Renliang Sun and Ming Yin and Boyuan Zheng and Zhenzhu Yang and Yibo Liu and Wenhao Huang and Huan Sun and Yu Su and Wenhu Chen},
            booktitle={Proceedings of CVPR},
            year={2024},
          }
```

### Groups, Tags, and Tasks

#### Groups

* `mmmu_val`
* `mmmu_val_art_and_design`
* `mmmu_val_business`
* `mmmu_val_health_and_medicine`
* `mmmu_val_humanities_and_social_science`
* `mmmu_val_science`
* `mmmu_val_tech_and_engineering`

#### Tags


#### Tasks

* `mmmu_val_accounting`
* `mmmu_val_agriculture`
* `mmmu_val_architecture_and_engineering.yaml`
* `mmmu_val_art`
* `mmmu_val_art_theory`
* `mmmu_val_basic_medical_science`
* `mmmu_val_biology`
* `mmmu_val_chemistry`
* `mmmu_val_computer_science`
* `mmmu_val_clinical_medicine`
* `mmmu_val_design`
* `mmmu_val_diagnostics_and_laboratory_medicine`
* `mmmu_val_electronics`
* `mmmu_val_energy_and_power`
* `mmmu_val_economics`
* `mmmu_val_finance`
* `mmmu_val_geography`
* `mmmu_val_history`
* ...

### Variants

The `mmmu_val` group implements MMMU using processing code [from the original MMMU authors](https://github.com/MMMU-Benchmark/MMMU/tree/main/mmmu) and uses the prompt format found in [the MMMU repository for Llava-1.5](https://github.com/MMMU-Benchmark/MMMU/blob/main/mmmu/configs/llava1.5.yaml). This implementation should give scores on par with or slightly higher than those reported by [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval/tree/main/lmms_eval/tasks/mmmu) for `mmmu_val` and the MMMU repository code.

Scores on several tested models (**all with `--apply_chat_template`**) are:

Qwen2-VL-2B:
```
hf-multimodal (pretrained=Qwen/Qwen2-VL-2B-Instruct,attn_implementation=flash_attention_2,dtype=bfloat16,convert_img_format=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 2
```
```
|             Groups             |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|--------------------------------|------:|------|------|------|---|-----:|---|-----:|
|mmmu_val                        |      0|none  |      |acc   |↑  |0.3778|±  |0.0155|
| - Art and Design               |      0|none  |      |acc   |↑  |0.5500|±  |0.0415|
| - Business                     |      0|none  |      |acc   |↑  |0.3600|±  |0.0389|
| - Health and Medicine          |      0|none  |      |acc   |↑  |0.3667|±  |0.0394|
| - Humanities and Social Science|      0|none  |      |acc   |↑  |0.5167|±  |0.0438|
| - Science                      |      0|none  |      |acc   |↑  |0.2467|±  |0.0352|
| - Tech and Engineering         |      0|none  |      |acc   |↑  |0.3143|±  |0.0317|
```
Author-reported score: 41.1%


Qwen2-VL-7B:
```
hf-multimodal (pretrained=Qwen/Qwen2-VL-7B-Instruct,attn_implementation=flash_attention_2,dtype=bfloat16,convert_img_format=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 2
```
```
|             Groups             |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|--------------------------------|------:|------|------|------|---|-----:|---|-----:|
|mmmu_val                        |      0|none  |      |acc   |↑  |0.5056|±  |0.0160|
| - Art and Design               |      0|none  |      |acc   |↑  |0.6917|±  |0.0398|
| - Business                     |      0|none  |      |acc   |↑  |0.4333|±  |0.0406|
| - Health and Medicine          |      0|none  |      |acc   |↑  |0.5667|±  |0.0401|
| - Humanities and Social Science|      0|none  |      |acc   |↑  |0.6750|±  |0.0426|
| - Science                      |      0|none  |      |acc   |↑  |0.3800|±  |0.0392|
| - Tech and Engineering         |      0|none  |      |acc   |↑  |0.4000|±  |0.0341|
```
Author-reported score: 54.1%

Idefics2-8B:
```
hf-multimodal (pretrained=HuggingFaceM4/idefics2-8b,attn_implementation=flash_attention_2,dtype=bfloat16,convert_img_format=True,max_images=2), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 2
```
```
|             Groups             |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|--------------------------------|------:|------|------|------|---|-----:|---|-----:|
|mmmu_val                        |      0|none  |      |acc   |↑  |0.4011|±  |0.0154|
| - Art and Design               |      0|none  |      |acc   |↑  |0.6167|±  |0.0436|
| - Business                     |      0|none  |      |acc   |↑  |0.3200|±  |0.0373|
| - Health and Medicine          |      0|none  |      |acc   |↑  |0.4000|±  |0.0401|
| - Humanities and Social Science|      0|none  |      |acc   |↑  |0.5750|±  |0.0424|
| - Science                      |      0|none  |      |acc   |↑  |0.2600|±  |0.0358|
| - Tech and Engineering         |      0|none  |      |acc   |↑  |0.3381|±  |0.0312|
```
Author-reported score: ~43%

Llava-v1.6-Mistral-7B:
```
hf-multimodal (pretrained=llava-hf/llava-v1.6-mistral-7b-hf,attn_implementation=flash_attention_2,dtype=bfloat16,convert_img_format=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 2
```
```
|             Groups             |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|--------------------------------|------:|------|------|------|---|-----:|---|-----:|
|mmmu_val                        |      0|none  |      |acc   |↑  |0.3522|±  |0.0151|
| - Art and Design               |      0|none  |      |acc   |↑  |0.5167|±  |0.0440|
| - Business                     |      0|none  |      |acc   |↑  |0.2667|±  |0.0362|
| - Health and Medicine          |      0|none  |      |acc   |↑  |0.3867|±  |0.0397|
| - Humanities and Social Science|      0|none  |      |acc   |↑  |0.5917|±  |0.0433|
| - Science                      |      0|none  |      |acc   |↑  |0.2200|±  |0.0342|
| - Tech and Engineering         |      0|none  |      |acc   |↑  |0.2524|±  |0.0299|
```
Author-reported score: 35.3%


### Checklist

For adding novel benchmarks/datasets to the library:
* [x] Is the task an existing benchmark in the literature?
  * [x] Have you referenced the original paper that introduced the task?
  * [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?


If other tasks on this dataset are already supported:
* [x] Is the "Main" variant of this task clearly denoted?
* [x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [x] Have you noted which, if any, published evaluation setups are matched by this variant?