Commit c04f261a authored by dongchy920's avatar dongchy920
Browse files

InstruceBLIP

parents
Pipeline #1594 canceled with stages
"""
# Copyright (c) 2022, salesforce.com, inc.
# All rights reserved.
# SPDX-License-Identifier: BSD-3-Clause
# For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
"""
import streamlit as st
from app import load_demo_image, device
from app.utils import load_model_cache
from lavis.processors import load_processor
from PIL import Image
def app():
model_type = st.sidebar.selectbox("Model:", ["BLIP"])
# ===== layout =====
st.markdown(
"<h1 style='text-align: center;'>Visual Question Answering</h1>",
unsafe_allow_html=True,
)
instructions = """Try the provided image or upload your own:"""
file = st.file_uploader(instructions)
col1, col2 = st.columns(2)
col1.header("Image")
if file:
raw_img = Image.open(file).convert("RGB")
else:
raw_img = load_demo_image()
w, h = raw_img.size
scaling_factor = 720 / w
resized_image = raw_img.resize((int(w * scaling_factor), int(h * scaling_factor)))
col1.image(resized_image, use_column_width=True)
col2.header("Question")
user_question = col2.text_input("Input your question!", "What are objects there?")
qa_button = st.button("Submit")
col2.header("Answer")
# ===== event =====
vis_processor = load_processor("blip_image_eval").build(image_size=480)
text_processor = load_processor("blip_question").build()
if qa_button:
if model_type.startswith("BLIP"):
model = load_model_cache(
"blip_vqa", model_type="vqav2", is_eval=True, device=device
)
img = vis_processor(raw_img).unsqueeze(0).to(device)
question = text_processor(user_question)
vqa_samples = {"image": img, "text_input": [question]}
answers = model.predict_answers(vqa_samples, inference_method="generate")
col2.write("\n".join(answers), use_column_width=True)
![Samples from the AVSD dataset (Image credit: "https://arxiv.org/pdf/1901.09107.pdf").](imgs/avsd_dialogue.png)(Samples from the AVSD dataset. Image credit: "https://arxiv.org/pdf/1901.09107.pdf")
# Audio-Visual Scene-Aware Dialogues (AVSD)
## Description
[Audio-Visual Scene-Aware Dialogues (AVSD)](https://github.com/hudaAlamri/DSTC7-Audio-Visual-Scene-Aware-Dialog-AVSD-Challenge) contains more than 10,000 dialogues, each of which is grounded on a unique video. In the test split, for each test sample, 6 reference dialogue responses are provided.
## Task
(https://github.com/hudaAlamri/DSTC7-Audio-Visual-Scene-Aware-Dialog-AVSD-Challenge)
In a **video-grounded dialogue task**, the system must generate responses to a user input in the context of a given dialog.
This context consists of a dialog history (previous utterances by both user and system) in addition to video and audio information that comprise the scene. The quality of a system’s automatically generated sentences is evaluated using objective measures to determine whether or not the generated responses are natural and informative
## Metrics
Models are typically evaluated according to [BLEU](https://aclanthology.org/P02-1040/), [CIDER](https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Vedantam_CIDEr_Consensus-Based_Image_2015_CVPR_paper.pdf), [METEOR](https://aclanthology.org/W05-0909/), and [ROUGE-L](https://aclanthology.org/W04-1013/) metrics.
## Leaderboard
TBD
## Auto-Downloading
Please refer to [benchmark webite](https://github.com/hudaAlamri/DSTC7-Audio-Visual-Scene-Aware-Dialog-AVSD-Challenge) for instruction to download the dataset.
## References
"Audio Visual Scene-Aware Dialog", Huda Alamri, Vincent Cartillier, Abhishek Das, Jue Wang, Anoop Cherian, Irfan Essa, Dhruv Batra, Tim K. Marks, Chiori Hori, Peter Anderson, Stefan Lee, Devi Parikh
![Samples from the COCO Caption dataset (Image credit: "https://arxiv.org/pdf/1504.00325.pdf").](imgs/coco_caption.png)(Samples from the COCO Caption dataset. Image credit: "https://arxiv.org/pdf/1504.00325.pdf")
# Microsoft COCO Dataset (Captioning)
## Description
[Microsoft COCO Captions dataset](https://github.com/tylin/coco-caption) contains over one and a half million captions describing over 330,000 images. For the training and validation images, five independent human generated captions are be provided for each image.
## Task
(from https://paperswithcode.com/task/image-captioning)
**Image captioning** is the task of describing the content of an image in words. This task lies at the intersection of computer vision and natural language processing. Most image captioning systems use an encoder-decoder framework, where an input image is encoded into an intermediate representation of the information in the image, and then decoded into a descriptive text sequence.
## Metrics
Models are typically evaluated according to a [BLEU](https://aclanthology.org/P02-1040/) or [CIDER](https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Vedantam_CIDEr_Consensus-Based_Image_2015_CVPR_paper.pdf) metric.
## Leaderboard
(Ranked by BLEU-4)
| Rank | Model | BLEU-4 | CIDEr | METEOR | SPICE | Resources |
| ---- | :-----: | :----: | :---: | :----: | :---: | :----------------------------------------------------------------------------------------------------------------------------------------------: |
| 1 | OFA | 44.9 | 154.9 | 32.5 | 26.6 | [paper](https://arxiv.org/abs/2202.03052), [code](https://github.com/OFA-Sys/OFA) |
| 2 | LEMON | 42.6 | 145.5 | 31.4 | 25.5 | [paper]() |
| 3 | CoCa | 40.9 | 143.6 | 33.9 | 24.7 | [paper](https://arxiv.org/pdf/2205.01917.pdf) |
| 4 | SimVLM | 40.6 | 143.3 | 33.7 | 25.4 | [paper](https://openreview.net/pdf?id=GUrhfTuf_3) |
| 5 | VinVL | 41.0 | 140.9 | 31.1 | 25.2 | [paper](https://arxiv.org/pdf/2101.00529v2.pdf), [code](https://github.com/microsoft/Oscar) |
| 6 | OSCAR | 40.7 | 140.0 | 30.6 | 24.5 | [paper](https://arxiv.org/pdf/2004.06165v5.pdf), [code](https://github.com/microsoft/Oscar) |
| 7 | BLIP | 40.4 | 136.7 | 31.4 | 24.3 | [paper](https://arxiv.org/pdf/2201.12086.pdf), [code](https://github.com/salesforce/BLIP), [demo](https://huggingface.co/spaces/Salesforce/BLIP) |
| 8 | M^2 | 39.1 | 131.2 | 29.2 | 22.6 | [paper](https://arxiv.org/pdf/1912.08226v2.pdf), [code](https://github.com/aimagelab/meshed-memory-transformer) |
| 9 | BUTD | 36.5 | 113.5 | 27.0 | 20.3 | [paper](https://arxiv.org/abs/1707.07998?context=cs), [code](https://github.com/peteanderson80/bottom-up-attention) |
| 10 | ClipCap | 32.2 | 108.4 | 27.1 | 20.1 | [paper](https://arxiv.org/pdf/2111.09734v1.pdf), [code](https://github.com/rmokady/clip_prefix_caption) |
## Auto-Downloading
```
cd lavis/datasets/download_scripts && python download_coco.py
```
## References
"Microsoft COCO Captions: Data Collection and Evaluation Server", Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, C. Lawrence Zitnick
![Samples from the COCO Caption dataset (Image credit: "https://arxiv.org/pdf/1504.00325.pdf").](imgs/coco_caption.png)(Samples from the COCO Caption dataset. Image credit: "https://arxiv.org/pdf/1504.00325.pdf")
# Microsoft COCO Dataset (Retrieval)
## Description
[Microsoft COCO dataset](https://github.com/tylin/coco-caption) contains over one and a half million captions describing over 330,000 images. For the training and validation images, five independent human generated captions are be provided for each image.
## Task
Cross modal retrieval: (1) **image-text**: given an image as query, retrieve texts from a gallery; (2) **text-image**: given a text as query, retrieval images from a gallery.
## Metrics
Common metrics are recall@k, denotes the [recall score](https://en.wikipedia.org/wiki/Precision_and_recall) after k retrieval efforts.
We use TR to denote the image-text retrieval recall score and IR to denote text-image retrieval score.
## Leaderboard
(Ranked by TR@1.)
| Rank | Model | TR@1 | TR@5 | TR@10 | IR@1 | IR@5 | IR@10 | Resources |
| ---- | :----: | :---: | :---: | :---: | :---: | :---: | :---: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
| 1 | BLIP | 82.4 | 95.4 | 97.9 | 65.1 | 86.3 | 91.8 | [paper](https://arxiv.org/pdf/2201.12086.pdf), [code](https://github.com/salesforce/BLIP), [demo](https://huggingface.co/spaces/Salesforce/BLIP), [blog](https://blog.salesforceairesearch.com/blip-bootstrapping-language-image-pretraining/) |
| 2 | X-VLM | 81.2 | 95.6 | 98.2 | 63.4 | 85.8 | 91.5 | [paper](https://arxiv.org/pdf/2111.08276v3.pdf), [code](https://github.com/zengyan-97/X-VLM) |
| 3 | ALBEF | 77.6 | 94.3 | 97.2 | 60.7 | 84.3 | 90.5 | [paper](https://arxiv.org/abs/2107.07651), [code](https://github.com/salesforce/ALBEF), [blog](https://blog.salesforceairesearch.com/align-before-fuse/) |
| 3 | ALIGN | 77.0 | 93.5 | 96.9 | 59.9 | 83.3 | 89.8 | [paper](https://arxiv.org/abs/2102.05918) |
| 4 | VinVL | 75.4 | 92.9 | 96.2 | 58.8 | 83.5 | 90.3 | [paper](https://arxiv.org/pdf/2101.00529v2.pdf), [code](https://github.com/microsoft/Oscar) |
| 5 | OSCAR | 73.5 | 92.2 | 96.0 | 57.5 | 82.8 | 89.8 | [paper](https://arxiv.org/pdf/2004.06165v5.pdf), [code](https://github.com/microsoft/Oscar) |
| 6 | UNITER | 65.7 | 88.6 | 93.8 | 52.9 | 79.9 | 88.0 | [paper](https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123750103.pdf), [code](https://github.com/ChenRocks/UNITER) |
## Auto-Downloading
```
cd lavis/datasets/download_scripts && python download_coco.py
```
## References
"Microsoft COCO Captions: Data Collection and Evaluation Server", Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, C. Lawrence Zitnick
![From https://arxiv.org/pdf/1505.00468.pdf.](imgs/conceptual_captions.png)
(image credit: https://ai.google.com/research/ConceptualCaptions/download)
# Conceptual Captions Dataset
## Description
(from https://huggingface.co/datasets/conceptual_captions)
Conceptual Captions 3M (CC3M) is a dataset consisting of ~3.3M images annotated with captions. In contrast with the curated style of other image caption annotations, Conceptual Caption images and their raw descriptions are harvested from the web, and therefore represent a wider variety of styles. More precisely, the raw descriptions are harvested from the Alt-text HTML attribute associated with web images. To arrive at the current version of the captions, we have developed an automatic pipeline that extracts, filters, and transforms candidate image/caption pairs, with the goal of achieving a balance of cleanliness, informativeness, fluency, and learnability of the resulting captions.
Conceptual Captions 12M (CC12M) is a dataset with 12 million image-text pairs specifically meant to be used for visionand-language pre-training. Its data collection pipeline is a relaxed version of the one used in Conceptual Captions 3M (CC3M).
## Task
Image-language pre-training; image captioning.
## Auto-Downloading
**Warning**: images of this dataset are downloadeded by requesting URLs. Since URLs may disappear with time, it is expected the downloaded dataset is partial.
### Conceptual Captions 3M
- Download images
```
cd lavis/datasets/download_scripts/DownloadConceptualCaptions && python download_data_cc3m.py
```
- Create annotations by running the notebook
```lavis/datasets/download_scripts/DownloadConceptualCaptions/create_annotation_3m.ipynb```
### Conceptual Captions 12M
- Download images
```
cd lavis/datasets/download_scripts/DownloadConceptualCaptions && python download_data_cc12m.py
```
- Create annotations by running the notebook
```lavis/datasets/download_scripts/DownloadConceptualCaptions/create_annotation_12m.ipynb```
## References
Edwin G. Ng, Bo Pang, Piyush Sharma and Radu Soricut. 2020. Understanding Guided Image Captioning Performance Across Domains. arXiv preprint arXiv:2012.02339.
![Samples from the DiDeMo dataset.](imgs/didemo.png)(Samples from the DiDeMo dataset. Image credit: "https://www.di.ens.fr/~miech/datasetviz/")
# DiDeMo Dataset (Retrieval)
## Description
[Microsoft COCO dataset](https://github.com/tylin/coco-caption) contains over one and a half million captions describing over 330,000 images. For the training and validation images, five independent human generated captions are be provided for each image.
## Task
Cross modal retrieval: (1) **image-text**: given an image as query, retrieve texts from a gallery; (2) **text-image**: given a text as query, retrieval images from a gallery.
## Metrics
Common metrics are recall@k, denotes the [recall score](https://en.wikipedia.org/wiki/Precision_and_recall) after k retrieval efforts.
We use TR to denote the image-text retrieval recall score and IR to denote text-image retrieval score.
## Leaderboard
(Ranked by TR@1.)
<!-- | Rank | Model | TR@1 | TR@5 | TR@10 | IR@1 | IR@5 | IR@10 | Resources |
| ---- | :----: | :---: | :---: | :---: | :---: | :---: | :---: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
| 1 | BLIP | 82.4 | 95.4 | 97.9 | 65.1 | 86.3 | 91.8 | [paper](https://arxiv.org/pdf/2201.12086.pdf), [code](https://github.com/salesforce/BLIP), [demo](https://huggingface.co/spaces/Salesforce/BLIP), [blog](https://blog.salesforceairesearch.com/blip-bootstrapping-language-image-pretraining/) |
| 2 | X-VLM | 81.2 | 95.6 | 98.2 | 63.4 | 85.8 | 91.5 | [paper](https://arxiv.org/pdf/2111.08276v3.pdf), [code](https://github.com/zengyan-97/X-VLM) |
| 3 | ALBEF | 77.6 | 94.3 | 97.2 | 60.7 | 84.3 | 90.5 | [paper](https://arxiv.org/abs/2107.07651), [code](https://github.com/salesforce/ALBEF), [blog](https://blog.salesforceairesearch.com/align-before-fuse/) |
| 3 | ALIGN | 77.0 | 93.5 | 96.9 | 59.9 | 83.3 | 89.8 | [paper](https://arxiv.org/abs/2102.05918) |
| 4 | VinVL | 75.4 | 92.9 | 96.2 | 58.8 | 83.5 | 90.3 | [paper](https://arxiv.org/pdf/2101.00529v2.pdf), [code](https://github.com/microsoft/Oscar) |
| 5 | OSCAR | 73.5 | 92.2 | 96.0 | 57.5 | 82.8 | 89.8 | [paper](https://arxiv.org/pdf/2004.06165v5.pdf), [code](https://github.com/microsoft/Oscar) |
| 6 | UNITER | 65.7 | 88.6 | 93.8 | 52.9 | 79.9 | 88.0 | [paper](https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123750103.pdf), [code](https://github.com/ChenRocks/UNITER) | -->
## Auto-Downloading
```
cd lavis/datasets/download_scripts && python download_didemo.py
```
## References
Anne Hendricks, Lisa, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. "Localizing moments in video with natural language." In Proceedings of the IEEE international conference on computer vision, pp. 5803-5812. 2017.
![Samples from Flickr30k dataset (Image credit: "https://bryanplummer.com/Flickr30kEntities/").](imgs/flickr30k.png)Samples from Flickr30k dataset (Image credit: "https://bryanplummer.com/Flickr30kEntities/")
# Flickr30K Dataset (Retrieval)
## Description
[Flickr30k](https://github.com/tylin/coco-caption) dataset contains 31k+ images collected from Flickr, together with 5 reference sentences provided by human annotators.
## Task
Cross modal retrieval: (1) **image-text**: given an image as query, retrieve texts from a gallery; (2) **text-image**: given a text as query, retrieval images from a gallery.
## Metrics
Common metrics are recall@k, denotes the [recall score](https://en.wikipedia.org/wiki/Precision_and_recall) after k retrieval efforts.
We use TR to denote the image-text retrieval recall score and IR to denote text-image retrieval score.
## Leaderboard
(Ranked by TR@1.)
| Rank | Model | TR@1 | TR@5 | TR@10 | IR@1 | IR@5 | IR@10 | Resources |
| ---- | :----: | :---: | :---: | :---: | :---: | :---: | :---: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
| 1 | BLIP | 97.2 | 99.9 | 100.0 | 87.5 | 97.7 | 98.9 | [paper](https://arxiv.org/pdf/2201.12086.pdf), [code](https://github.com/salesforce/BLIP), [demo](https://huggingface.co/spaces/Salesforce/BLIP), [blog](https://blog.salesforceairesearch.com/blip-bootstrapping-language-image-pretraining/) |
| 2 | X-VLM | 97.1 | 100.0 | 100.0 | 86.9 | 97.3 | 98.7 | [paper](https://arxiv.org/pdf/2111.08276v3.pdf), [code](https://github.com/zengyan-97/X-VLM) |
| 3 | ALBEF | 95.9 | 99.8 | 100.0 | 85.6 | 97.5 | 98.9 | [paper](https://arxiv.org/abs/2107.07651), [code](https://github.com/salesforce/ALBEF), [blog](https://blog.salesforceairesearch.com/align-before-fuse/) |
| 4 | ALIGN | 95.3 | 99.8 | 100.0 | 84.9 | 97.4 | 98.6 | [paper](https://arxiv.org/abs/2102.05918) | |
| 5 | VILLA | 87.9 | 97.5 | 98.8 | 76.3 | 94.2 | 96.8 | [paper](https://arxiv.org/pdf/2004.06165v5.pdf), [code](https://github.com/microsoft/Oscar) |
| 6 | UNITER | 87.3 | 98.0 | 99.2 | 75.6 | 94.1 | 96.8 | [paper](https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123750103.pdf), [code](https://github.com/ChenRocks/UNITER) |
## Auto-Downloading
```
cd lavis/datasets/download_scripts && python download_flickr.py
```
## References
Bryan A. Plummer, Liwei Wang, Christopher M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik, Flickr30K Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models, IJCV, 123(1):74-93, 2017. [paper]
![From https://arxiv.org/abs/1902.09506.pdf.](imgs/gqa.png)
# GQA Dataset
## Description
(from https://cs.stanford.edu/people/dorarad/gqa/about.html)
GQA is a VQA dataset for real-word images which requires visual, spatial and compositional reasoning.
It consists of 22M questions and 110K images.
## Task
(from https://arxiv.org/abs/1902.09506)
Given an image and a question, the model is required to output a correct answer.
GQA questions require spatial understanding, multiple reasoning skills and multiple-step inference.
## Metrics
The metrics are accuracy, consistency, validity, plausibility. The commonly reported metric is accuracy.
## Leaderboard
TBD
## Auto-Downloading
```
cd lavis/datasets/download_scripts && python download_gqa.py
```
## References
"GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering", Drew A. Hudson, Christopher D. Manning
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment