snli_visual_entailment.md 3.27 KB
Newer Older
dongchy920's avatar
dongchy920 committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
![From https://github.com/necla-ml/SNLI-VE.](imgs/snli_ve.png)

# SNLI-VE: Visual Entailment Dataset

## Description
(from https://arxiv.org/abs/1811.10582)

**The SNLI_VE dataset is built on top of Flickr30k. See downloading scripts below.**

Distribution by Split
The data details of train, dev and test split is shown below. The instances of three labels (entailment, neutral and contradiction) are evenly distributed for each split.

|Train|	Dev	|Test|
| ---- | :----: | :------: |
|#Image | 29783 | 1000 |	1000
|#Entailment | 176932 | 5959 | 	5973
|#Neutral| 176045 | 5960 | 	5964
|#Contradiction |	176550| 5939|	5964
|Vocabulary | Size|	29550| 6576|	6592

## Task
(from https://github.com/necla-ml/SNLI-VE)

The problem that Visual Entailment (VE) is trying to solve is to reason about the relationship between an image premise Pimage and a text hypothesis Htext.

Specifically, given an image as premise, and a natural language sentence as hypothesis, three labels (entailment, neutral and contradiction) are assigned based on the relationship conveyed by the (P_{image}, H_{text})

entailment holds if there is enough evidence in P_{image} to conclude that H_{text} is true.
contradiction holds if there is enough evidence in P_{image} to conclude that H_{text} is false.
Otherwise, the relationship is neutral, implying the evidence in P_{image} is insufficient to draw a conclusion about H_{text}.


## Metrics
Accuracy.

## Leaderboard
(Ranked by accurarcy on dev.)
| Rank | Model  | dev | test | Resources |
| ---- | :----: | :------: | :------: | :-------: |
| 1    |  CoCa  |   87.0   |   87.1   |  [paper](https://arxiv.org/pdf/2205.01917.pdf) |
| 2    | SimVLM  |   86.2   |   86.3   | [paper](https://openreview.net/pdf?id=GUrhfTuf_3) |
| 3    | SOHO  |   85.0   |  85.0  | [paper](https://arxiv.org/pdf/2104.03135.pdf), [code](https://github.com/researchmm/soho) |
| 4    | ALBEF  |   80.8   |   80.9   |  [paper](https://arxiv.org/abs/2107.07651), [code](https://github.com/salesforce/ALBEF), [blog](https://blog.salesforceairesearch.com/align-before-fuse/)                                                 |
| 5    | VILLA  | 80.2  | 80.0  |                                                                          [paper](https://arxiv.org/pdf/2004.06165v5.pdf), [code](https://github.com/microsoft/Oscar)                                                                           |
| 6    | UNITER | 79.4  | 79.4 |                                                          [paper](https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123750103.pdf), [code](https://github.com/ChenRocks/UNITER)                                                          |
| 7    | LXMERT | 72.4  | 72.5 |                                                          [paper](https://aclanthology.org/D19-1514.pdf), [code](https://github.com/airsplay/lxmert)                                                          |
| 8    |  BUTD   |  65.3  | 65.7 |   [paper](https://arxiv.org/abs/1707.07998?context=cs), [code](https://github.com/peteanderson80/bottom-up-attention)                |

## Auto-Downloading
```
cd lavis/datasets/download_scripts && python download_flickr.py
```

## References
Xie, Ning, Farley Lai, Derek Doran, and Asim Kadav. "Visual entailment task for visually-grounded language learning." arXiv preprint arXiv:1811.10582 (2018).