EVAL_MME.md 2.67 KB
Newer Older
wanglch's avatar
wanglch committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
# MME Benchmark

[MME](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation) is a comprehensive evaluation benchmark for multimodal large language models. It measures both perception and cognition abilities on a total of 14 subtasks, including existence, count, position, color, poster, celebrity, scene, landmark, artwork, OCR, commonsense reasoning, numerical calculation, text translation, and code reasoning.

Qwen-VL-Chat achieves SOTAs on both perception and cognition evaluation.

Perception Evaluation

| Rank |      Model      |          Version         |  Score  |
|:----:|:---------------:|:------------------------:|:-------:|
|   1  | **[Qwen-VL-Chat](https://github.com/QwenLM/Qwen-VL/)**|        **[Qwen-7B](https://github.com/QwenLM/Qwen-7B)**       | **1487.57** |
|   2  |    Skywork-MM   |      Skywork-MM-13B      | 1419.08 |
|   3  |      MMICL      |         FlanT5xxl        | 1376.00 |
|   4  |       Lynx      |         vicuna-7b        | 1373.23 |
|   5  |      BLIVA      |         FlanT5xxl        | 1337.73 |

Cognition Evaluation

| Rank |       Model      |     Version    |    Score   |
|:----:|:----------------:|:--------------:|:----------:|
|   1  | **[Qwen-VL-Chat](https://github.com/QwenLM/Qwen-VL/)** |   **[Qwen-7B](https://github.com/QwenLM/Qwen-7B)**  | **360.71** |
|   2  |       MMICL      |    FlanT5xxl   |   360.36   |
|   3  |    Skywork-MM    | Skywork-MM-13B |   356.43   |
|   4  |       BLIVA      |    FlanT5xxl   |   331.43   |
|   5  |  LRV-Instruction |     LRV-7B     |   328.21   |

Full Metrics

```
=========== Perception ===========
total score: 1487.576330532213 

         existence  score: 158.33333333333331
         count  score: 150.0
         position  score: 128.33333333333334
         color  score: 170.0
         posters  score: 178.57142857142856
         celebrity  score: 120.58823529411764
         scene  score: 152.25
         landmark  score: 164.0
         artwork  score: 125.5
         OCR  score: 140.0


=========== Cognition ===========
total score: 360.71428571428567 

         commonsense_reasoning  score: 130.7142857142857
         numerical_calculation  score: 40.0
         text_translation  score: 147.5
         code_reasoning  score: 42.5
```

## How To Reproduce Results of MME Benchmark

1. Download MME images and eval_tool from the [MME repo](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/blob/Evaluation/README.md)
2. Rearrange images by executing `python get_images.py`
3. Evaluate Qwen-VL-Chat results by executing `python eval.py`
4. Calculate MME results by executing `python calculation.py --results_dir Qwen-VL-Chat`, which the calculation script comes from the MME eval_tool.