Commit c04f261a authored by dongchy920's avatar dongchy920
Browse files

InstruceBLIP

parents
Pipeline #1594 canceled with stages
![Samples from MSRVTT-QA dataset.](imgs/msrvtt_qa.png)(Samples from MSRVTT-QA dataset, image credit: http://staff.ustc.edu.cn/~hexn/papers/mm17-videoQA.pdf)
# MSRVTT Dataset (Video Question Answering)
## Description
[MSRVTT](https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/) dataset is a large-scale video benchmark for video understanding, especially the emerging task of translating video to text. This is achieved by collecting 257 popular queries from a commercial video search engine, with 118 videos for each query. In its current version, MSR-VTT provides 10K web video clips with 41.2 hours and 200K clip-sentence pairs in total. Each clip is annotated with about 20 natural sentences by 1,327 AMT workers.
[MSRVTT-QA](http://staff.ustc.edu.cn/~hexn/papers/mm17-videoQA.pdf) dataset is based on the MSR-VTT dataset, which is larger and has more complex scenes. The dataset
contains 10K video clips and 243k question answer pairs.
## Task
Video question answering (VideoQA) is the task where
a video and a natural language question are provided and the model
needs to give the right answer (from [paper](http://staff.ustc.edu.cn/~hexn/papers/mm17-videoQA.pdf)).
## Metrics
Accuracy.
## Leaderboard
(Ranked by accurarcy on test-dev.)
| Rank | Model | Acc. | Resources |
| ---- | :----: | :-------: | :-------: |
| 1 | ALPro | 42.1 | [paper](https://arxiv.org/abs/2112.09583), [code](https://github.com/salesforce/ALPRO), [blog](https://blog.salesforceairesearch.com/alpro/) |
| 2 | VQA-T | 41.5 | [paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Yang_Just_Ask_Learning_To_Answer_Questions_From_Millions_of_Narrated_ICCV_2021_paper.pdf), [code](https://github.com/antoyang/just-ask), [demo](http://videoqa.paris.inria.fr/) |
| 3 | CoMVT | 39.5 | [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Seo_Look_Before_You_Speak_Visually_Contextualized_Utterances_CVPR_2021_paper.pdf) |
| 4 | SSML | 35.1 | [paper](https://arxiv.org/abs/2003.03186) |
| 5 | ClipBERT | 37.4 | [paper](https://arxiv.org/abs/2102.06183) [code](https://github.com/jayleicn/ClipBERT)|
| 6 | HCRN | 35.6 | [paper](https://arxiv.org/abs/2002.10698) [code](https://github.com/thaolmk54/hcrn-videoqa) |
| 7 | HGA | 35.5 | [paper](https://ojs.aaai.org/index.php/AAAI/article/view/6767) [code](https://github.com/Jumpin2/HGA) |
| 8 | DualVGR | 35.5 | [paper](https://arxiv.org/pdf/2107.04768v1.pdf) [code](https://github.com/NJUPT-MCC/DualVGR-VideoQA) |
| 9 | HME | 33.0 | [paper](https://arxiv.org/pdf/1904.04357.pdf), [code](https://github.com/fanchenyou/HME-VideoQA) |
| 10 | AMU | 32.5 | [paper](http://staff.ustc.edu.cn/~hexn/papers/mm17-videoQA.pdf), [code](https://github.com/xudejing/video-question-answering) |
## Auto-Downloading
```
cd lavis/datasets/download_scripts && python download_msrvtt.py
```
## References
Xu, Jun, Tao Mei, Ting Yao, and Yong Rui. "Msr-vtt: A large video description dataset for bridging video and language." In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5288-5296. 2016.
Xu, Dejing, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. "Video question answering via gradually refined attention over appearance and motion." In Proceedings of the 25th ACM international conference on Multimedia, pp. 1645-1653. 2017.
![Samples from Flickr30k dataset (Image credit: "https://bryanplummer.com/Flickr30kEntities/").](imgs/msrvtt.png)
# MSRVTT Dataset (Retrieval)
## Description
[MSRVTT](https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/) dataset is a large-scale video benchmark for video understanding, especially the emerging task of translating video to text. This is achieved by collecting 257 popular queries from a commercial video search engine, with 118 videos for each query. In its current version, MSR-VTT provides 10K web video clips with 41.2 hours and 200K clip-sentence pairs in total. Each clip is annotated with about 20 natural sentences by 1,327 AMT workers.
## Task
Cross modal retrieval: (1) **video-text**: given a video as query, retrieve texts from a gallery; (2) **text-video**: given a text as query, retrieval videos from a gallery.
## Metrics
Common metrics are recall@k, denotes the [recall score](https://en.wikipedia.org/wiki/Precision_and_recall) after k retrieval efforts.
We use TR to denote the video-text retrieval recall score and VR to denote text-video retrieval score.
## Leaderboard
(Ranked by TR@1.)
<!-- | Rank | Model | TR@1 | TR@5 | TR@10 | IR@1 | IR@5 | IR@10 | Resources |
| ---- | :----: | :---: | :---: | :---: | :---: | :---: | :---: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
| 1 | BLIP | 82.4 | 95.4 | 97.9 | 65.1 | 86.3 | 91.8 | [paper](https://arxiv.org/pdf/2201.12086.pdf), [code](https://github.com/salesforce/BLIP), [demo](https://huggingface.co/spaces/Salesforce/BLIP), [blog](https://blog.salesforceairesearch.com/blip-bootstrapping-language-image-pretraining/) | -->
## References
Xu, Jun, Tao Mei, Ting Yao, and Yong Rui. "Msr-vtt: A large video description dataset for bridging video and language." In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5288-5296. 2016.
![Samples from MSVD-QA dataset.](imgs/msvd_qa.png)(Samples from MSVD-QA dataset, image credit: http://staff.ustc.edu.cn/~hexn/papers/mm17-videoQA.pdf)
# MSVD Dataset (Video Question Answering)
## Description
[MSVD-QA](http://staff.ustc.edu.cn/~hexn/papers/mm17-videoQA.pdf) dataset is based on Microsoft Research Video
Description Corpus (https://www.cs.utexas.edu/users/ml/clamp/videoDescription/) which is used in many video captioning
experiments. The MSVD-QA dataset has a total number of 1,970
video clips and 50,505 question answer pairs.
## Task
Video question answering (VideoQA) is the task where
a video and a natural language question are provided and the model
needs to give the right answer (from [paper](http://staff.ustc.edu.cn/~hexn/papers/mm17-videoQA.pdf)).
## Metrics
Accuracy.
## Leaderboard
(Ranked by accurarcy on test-dev.)
| Rank | Model | Acc. | Resources |
| ---- | :----: | :-------: | :-------: |
| 1 | VQA-T | 46.3 | [paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Yang_Just_Ask_Learning_To_Answer_Questions_From_Millions_of_Narrated_ICCV_2021_paper.pdf), [code](https://github.com/antoyang/just-ask), [demo](http://videoqa.paris.inria.fr/) |
| 2 | ALPro | 45.9 | [paper](https://arxiv.org/abs/2112.09583), [code](https://github.com/salesforce/ALPRO), [blog](https://blog.salesforceairesearch.com/alpro/) |
| 3 | CoMVT | 42.6 | [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Seo_Look_Before_You_Speak_Visually_Contextualized_Utterances_CVPR_2021_paper.pdf) |
| 4 | DualVGR | 39.0 | [paper](https://arxiv.org/pdf/2107.04768v1.pdf) [code](https://github.com/NJUPT-MCC/DualVGR-VideoQA) |
| 5 | HCRN | 36.1 | [paper](https://arxiv.org/abs/2002.10698) [code](https://github.com/thaolmk54/hcrn-videoqa) |
| 6 | SSML | 35.1 | [paper](https://arxiv.org/abs/2003.03186) |
| 7 | HGA | 34.7 | [paper](https://ojs.aaai.org/index.php/AAAI/article/view/6767) [code](https://github.com/Jumpin2/HGA) |
| 8 | HME | 33.7 | [paper](https://arxiv.org/pdf/1904.04357.pdf), [code](https://github.com/fanchenyou/HME-VideoQA) |
| 9 | AMU | 32.0 | [paper](http://staff.ustc.edu.cn/~hexn/papers/mm17-videoQA.pdf), [code](https://github.com/xudejing/video-question-answering) |
| 10 | ST-VQA | 31.3 | [paper](https://arxiv.org/pdf/1704.04497.pdf), [code](https://github.com/YunseokJANG/tgif-qa) |
## Auto-Downloading
```
cd lavis/datasets/download_scripts && python download_msvd.py
```
## References
Chen, David, and William B. Dolan. "Collecting highly parallel data for paraphrase evaluation." In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pp. 190-200. 2011.
Xu, Dejing, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. "Video question answering via gradually refined attention over appearance and motion." In Proceedings of the 25th ACM international conference on Multimedia, pp. 1645-1653. 2017.
![From https://arxiv.org/pdf/1505.00468.pdf.](imgs/NLVR2.png)
# Natural Language for Visual Reasoning for Real (NLVR2)
## Description
(from https://lil.nlp.cornell.edu/nlvr/)
NLVR2 contains 107,292 examples of human-written English sentences grounded in pairs of photographs. NLVR2 retains the linguistic diversity of NLVR, while including much more visually complex images.
We only publicly release the sentence annotations and original image URLs, and scripts that download the images from the URLs. If you would like direct access to the images, please fill out this Google Form. This form asks for your basic information and asks you to agree to our Terms of Service.
## Task
(from https://lil.nlp.cornell.edu/nlvr/)
The Natural Language for Visual Reasoning (NLVR) task is to determine whether a sentence is true about a visual input. The data was collected through crowdsourcings, and solving the task requires reasoning about sets of objects, comparisons, and spatial relations. This includes two corpora: NLVR, with synthetically generated images, and NLVR2, which includes natural photographs.
## Metrics
Accuracy.
## Leaderboard
(Ranked by accurarcy on dev.)
| Rank | Model | dev | test | Resources |
| ---- | :----: | :------: | :------: | :-------: |
| 1 | VLMo | 88.6 | 89.5 | [paper](https://arxiv.org/pdf/2111.02358.pdf) |
| 2 | CoCa | 86.1 | 87.0 | [paper](https://arxiv.org/pdf/2205.01917.pdf) |
| 3 | SimVLM | 84.5 | 85.2 | [paper](https://openreview.net/pdf?id=GUrhfTuf_3) |
| 4 | X-VLM | 84.4 | 84.8 | [paper](https://arxiv.org/pdf/2111.08276v3.pdf), [code](https://github.com/zengyan-97/X-VLM)
| 5 | VinVL | 82.7 | 84.0 | [paper](https://arxiv.org/pdf/2101.00529.pdf), [code](https://github.com/pzzhang/VinVL) |
| 6 | ALBEF | 82.6 | 83.1 | [paper](https://arxiv.org/abs/2107.07651), [code](https://github.com/salesforce/ALBEF), [blog](https://blog.salesforceairesearch.com/align-before-fuse/) |
| 7 | BLIP | 82.2 | 82.2 | [paper](https://arxiv.org/pdf/2201.12086.pdf), [code](https://github.com/salesforce/BLIP), [demo](https://huggingface.co/spaces/Salesforce/BLIP), [blog](https://blog.salesforceairesearch.com/blip-bootstrapping-language-image-pretraining/)|
| 8 | OSCAR | 78.1 | 78.4 | [paper](https://arxiv.org/pdf/2004.06165v5.pdf), [code](https://github.com/microsoft/Oscar) |
| 9 | SOHO | 76.4 | 77.3 | [paper](https://arxiv.org/pdf/2104.03135.pdf), [code](https://github.com/researchmm/soho) |
| 10 | UNITER | 77.2 | 77.9 | [paper](https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123750103.pdf), [code](https://github.com/ChenRocks/UNITER) |
## Downloading
Auto-downloading is not supported for this dataset. Please refer to https://lil.nlp.cornell.edu/nlvr/ and fill in the Google form to download the original images.
## References
Suhr, Alane, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi. "A corpus for reasoning about natural language grounded in photographs." arXiv preprint arXiv:1811.00491 (2018).
![Samples from the COCO Caption dataset (Image credit: "https://arxiv.org/pdf/1504.00325.pdf").](imgs/nocaps.png)
# Nocaps
## Description
our benchmark consists of 166,100 human-generated captions describing 15,100 images from the Open Images validation and test sets. The associated training data consists of COCO image-caption pairs, plus Open Images image-level labels and object bounding boxes. Since Open Images contains many more classes than COCO, nearly 400 object classes seen in test images have no or very few associated training captions (hence, nocaps).
## Task: Novel object captioning
(from https://nocaps.org/)
Image captioning models have achieved impressive results on datasets containing limited visual concepts and large amounts of paired image-caption training data. However, if these models are to ever function in the wild, a much larger variety of visual concepts must be learned, ideally from less supervision. To encourage the development of image captioning models that can learn visual concepts from alternative data sources, such as object detection datasets, we present the first large-scale benchmark for this task. Dubbed nocaps, for novel object captioning at scale
## Metrics
Models are typically evaluated according to a [CIDEr](https://aclanthology.org/P02-1040/) or [SPICE](https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Vedantam_CIDEr_Consensus-Based_Image_2015_CVPR_paper.pdf) metric.
## Leaderboard
(Ranked by CIDEr)
| Rank | Model | val. CIDEr | val. SPICE | test CIDEr | test SPICE | Resources |
| ---- | :-----: | :----: | :---: | :----------------------------------------------------------------------------------------------------------------------------------------------: | :---:|:---:|
| 1 | CoCa | 122.4 | 15.5 | 120.6 | 15.5| [paper](https://arxiv.org/pdf/2205.01917.pdf) |
| 2 | LEMON | 117.3 | 15.0 |114.3 | 14.9 | [paper]() |
| 3 | BLIP | 113.2 | 14.8 | - | - | [paper](https://arxiv.org/pdf/2201.12086.pdf), [code](https://github.com/salesforce/BLIP), [demo](https://huggingface.co/spaces/Salesforce/BLIP) |
| 4 | SimVLM | 112.2 | - | 110.3 | 14.5 | [paper](https://openreview.net/pdf?id=GUrhfTuf_3) |
| 5 | VinVL | 105.1 | 14.4 | 103.7 | 14.4 | [paper](https://arxiv.org/pdf/2101.00529v2.pdf), [code](https://github.com/microsoft/Oscar) |
## Auto-Downloading
```
cd lavis/datasets/download_scripts && python download_nocaps.py
```
## References
Agrawal, Harsh, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. "Nocaps: Novel object captioning at scale." In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8948-8957. 2019.
![sbu caption](imgs/sbu_caption.png)
(image credit: http://tamaraberg.com/papers/generation_nips2011.pdf)
# SBU Caption Dataset
(from http://tamaraberg.com/papers/generation_nips2011.pdf)
SBU caption dataset is a new dataset, collected by performing Flickr queries and
then filtering the noisy results down to 1 million images with associated visually
relevant captions.
## Auto-Downloading
```
cd lavis/datasets/download_scripts && python download_sbu.py
```
## References
```bibtex
@inproceedings{Ordonez:2011:im2text,
Author = {Vicente Ordonez and Girish Kulkarni and Tamara L. Berg},
Title = {Im2Text: Describing Images Using 1 Million Captioned Photographs},
Booktitle = {Neural Information Processing Systems ({NIPS})},
Year = {2011},
}
```
![From https://github.com/necla-ml/SNLI-VE.](imgs/snli_ve.png)
# SNLI-VE: Visual Entailment Dataset
## Description
(from https://arxiv.org/abs/1811.10582)
**The SNLI_VE dataset is built on top of Flickr30k. See downloading scripts below.**
Distribution by Split
The data details of train, dev and test split is shown below. The instances of three labels (entailment, neutral and contradiction) are evenly distributed for each split.
|Train| Dev |Test|
| ---- | :----: | :------: |
|#Image | 29783 | 1000 | 1000
|#Entailment | 176932 | 5959 | 5973
|#Neutral| 176045 | 5960 | 5964
|#Contradiction | 176550| 5939| 5964
|Vocabulary | Size| 29550| 6576| 6592
## Task
(from https://github.com/necla-ml/SNLI-VE)
The problem that Visual Entailment (VE) is trying to solve is to reason about the relationship between an image premise Pimage and a text hypothesis Htext.
Specifically, given an image as premise, and a natural language sentence as hypothesis, three labels (entailment, neutral and contradiction) are assigned based on the relationship conveyed by the (P_{image}, H_{text})
entailment holds if there is enough evidence in P_{image} to conclude that H_{text} is true.
contradiction holds if there is enough evidence in P_{image} to conclude that H_{text} is false.
Otherwise, the relationship is neutral, implying the evidence in P_{image} is insufficient to draw a conclusion about H_{text}.
## Metrics
Accuracy.
## Leaderboard
(Ranked by accurarcy on dev.)
| Rank | Model | dev | test | Resources |
| ---- | :----: | :------: | :------: | :-------: |
| 1 | CoCa | 87.0 | 87.1 | [paper](https://arxiv.org/pdf/2205.01917.pdf) |
| 2 | SimVLM | 86.2 | 86.3 | [paper](https://openreview.net/pdf?id=GUrhfTuf_3) |
| 3 | SOHO | 85.0 | 85.0 | [paper](https://arxiv.org/pdf/2104.03135.pdf), [code](https://github.com/researchmm/soho) |
| 4 | ALBEF | 80.8 | 80.9 | [paper](https://arxiv.org/abs/2107.07651), [code](https://github.com/salesforce/ALBEF), [blog](https://blog.salesforceairesearch.com/align-before-fuse/) |
| 5 | VILLA | 80.2 | 80.0 | [paper](https://arxiv.org/pdf/2004.06165v5.pdf), [code](https://github.com/microsoft/Oscar) |
| 6 | UNITER | 79.4 | 79.4 | [paper](https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123750103.pdf), [code](https://github.com/ChenRocks/UNITER) |
| 7 | LXMERT | 72.4 | 72.5 | [paper](https://aclanthology.org/D19-1514.pdf), [code](https://github.com/airsplay/lxmert) |
| 8 | BUTD | 65.3 | 65.7 | [paper](https://arxiv.org/abs/1707.07998?context=cs), [code](https://github.com/peteanderson80/bottom-up-attention) |
## Auto-Downloading
```
cd lavis/datasets/download_scripts && python download_flickr.py
```
## References
Xie, Ning, Farley Lai, Derek Doran, and Asim Kadav. "Visual entailment task for visually-grounded language learning." arXiv preprint arXiv:1811.10582 (2018).
![From https://arxiv.org/pdf/1505.00468.pdf.](imgs/vqav2.png)
# Microsoft COCO Dataset (VQAv2)
## Description
(from https://visualqa.org/index.html)
Visual Question Answering (VQA) v2.0 is a dataset containing open-ended questions about images. These questions require an understanding of vision, language and commonsense knowledge to answer. It is the second version of the VQA dataset.
- 265,016 images (COCO and abstract scenes)
- At least 3 questions (5.4 questions on average) per image
- 10 ground truth answers per question
- 3 plausible (but likely incorrect) answers per question
- Automatic evaluation metric
## Task
(from https://arxiv.org/pdf/1505.00468.pdf)
The task of free-form and open-ended Visual Question Answering (VQA): given an image and a natural
language question about the image, the task is to provide an accurate natural language answer. Mirroring real-world scenarios, such
as helping the visually impaired, both the questions and answers are open-ended..
## Metrics
Accuracies computed by evaluation server: https://eval.ai/web/challenges/challenge-page/830/leaderboard/2278
## Leaderboard
(Ranked by accurarcy on test-dev.)
| Rank | Model | test-dev | test-std | Resources |
| ---- | :----: | :------: | :------: | :-------: |
| 1 | VLMo | 82.8 | 82.8 | [paper](https://arxiv.org/pdf/2205.01917.pdf) |
| 2 | CoCa | 82.3 | 82.3 | [paper](https://arxiv.org/pdf/2205.01917.pdf) |
| 3 | OFA | 82.0 | 82.0 | [paper](https://arxiv.org/abs/2202.03052), [code](https://github.com/OFA-Sys/OFA) |
| 4 | Florence | 80.2 | 80.4 | [paper](https://arxiv.org/abs/2111.11432), [code](https://github.com/OFA-Sys/OFA) |
| 5 | SimVLM | 80.0 | 80.3 | [paper](https://openreview.net/pdf?id=GUrhfTuf_3) |
| 6 | BLIP | 78.3 | 78.3 | [paper](https://arxiv.org/pdf/2201.12086.pdf), [code](https://github.com/salesforce/BLIP), [demo](https://huggingface.co/spaces/Salesforce/BLIP) |
| 7 | X-VLM | 78.2 | 78.4 | [paper](https://arxiv.org/pdf/2111.08276v3.pdf), [code](https://github.com/zengyan-97/X-VLM) |
| 8 | VinVL | 76.6 | 76.6 | [paper](https://arxiv.org/pdf/2101.00529v2.pdf), [code](https://github.com/microsoft/Oscar) |
| 9 | ALBEF | 75.8 | 76.0 | [paper](https://arxiv.org/abs/2107.07651), [code](https://github.com/salesforce/ALBEF), [blog](https://blog.salesforceairesearch.com/align-before-fuse/) |
| 10 | UNITER | 73.8 | 74.0 | [paper](https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123750103.pdf), [code](https://github.com/ChenRocks/UNITER) |
## Auto-Downloading
```
cd lavis/datasets/download_scripts && python download_coco.py
```
## References
"Microsoft COCO Captions: Data Collection and Evaluation Server", Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, C. Lawrence Zitnick
"Vqa: Visual question answering." Antol, Stanislaw, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh.
FROM image.sourcefind.cn:5000/dcu/admin/base/pytorch:2.1.0-ubuntu20.04-dtk24.04.1-py3.8
RUN source /opt/dtk/env.sh
\ No newline at end of file
# Minimal makefile for Sphinx documentation
#
# You can set these variables from the command line, and also
# from the environment for the first two.
SPHINXOPTS ?=
SPHINXBUILD ?= sphinx-build
SOURCEDIR = source
BUILDDIR = build
# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
.PHONY: help Makefile
# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment